EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Education Policy Analysis Archives 


home abstracts complete editors submit comment notices 


A peer-reviewed scholarly 
journal 

Editor: Gene V Glass 
College of Education 
Arizona State University 

search 


Copyright is retained by the first or sole author, who grants right of first 
publication to the EDUCATION POLICY ANALYSIS ARCHIVES. EPAA 

is a project of the Education Policy Studies Laboratory. 

Articles appearing in EPAA are abstracted in the Current Index to 
Journals in Education by the ERIC Clearinghouse on Assessment and 
Evaluation and are permanently archived in Resources in Education. 

PDF file of this article available: 


This article has been retrieved 


□ 


times since July 20, 2004 


Volume 12 Number 32 July 20, 2004 


ISSN 1068-2341 


Interrogating the Generalizability of Portfolio Assessments of 
Beginning Teachers: A Qualitative Study 

Pamela A. Moss 
LeeAnn M. Sutherland 
Laura Haniford 
Renee Miller 
David Johnson 
University of Michigan 

Pamela K. Geist 
Denver, Colorado 

Stephen M. Koziol, Jr. 

University of Maryland 

Jon R. Star 

Michigan State University 
Raymond L. Pecheone 


1 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Stanford University 

Citation: Moss P.A., Sutherland, L.M., Haniford, L., Miller, R., Johnson, D., Geist, P.K., Koziol, 

S.M., Star, J.R., Pecheone, R.L., (2004, July 20). Interrogating the generalizability of portfolio 
assessments of beginning teachers: A qualitative study, Education Policy Analysis Archives, 

12(32). Retrieved [Date] from http://epaa.asu.edu/epaa/v12n32/. 

Abstract 

This qualitative study is intended to illuminate factors that affect 
the generalizability of portfolio assessments of beginning 
teachers. By generalizability, we refer here to the extent to which 
the portfolio assessment supports generalizations from the 
particular evidence reflected in the portfolio to the conception of 
competent teaching reflected in the standards on which the 
assessment is based. Or, more practically, “The key question is, 

‘How likely is it that this finding would be reversed or substantially 
altered if a second, independent assessment of the same kind 
were made?’” (Cronbach, Linn, Brennan, and Haertel, 1997, p. 1). 

In addressing this question, we draw on two kinds of evidence 
that are rarely available: comparisons of two different portfolios 
completed by the same teacher in the same year and 
comparisons between a portfolio and a multi-day case study 
(observation and interview completed shortly after portfolio 
submission) intended to parallel the evidence called for in the 
portfolio assessment. Our formative goal is to illuminate issues 
that assessment developers and users can take into account in 
designing assessment systems and appropriately limiting score 
interpretations. (Note 1) 

Introduction 

A growing number of states are using some form of standardized assessment to 
assist in the licensure decisions about beginning teachers. Among the 42 states 
requiring such tests in 2000, the most widely used were paper-and-pencil tests 
assessing varied combinations of basic skills, content knowledge, or 
pedagogical knowledge (NRC, 2001b). The National Research Council's 
"Committee on Assessment and Teacher Quality" concluded that "paper and 
pencil tests provide only some of the information needed to evaluate the 
competencies of teacher candidates" (NRC, 2001b, p. 69). The committee 
called for additional research into the development of licensure systems that 
include assessment of teaching performance. As evidenced in the work of the 
National Board for Professional Teaching Standards (NBPTS), portfolio 
assessment provides one credible means for the large-scale high-stakes 
assessment of teaching performance. The Interstate New Teacher Assessment 
and Support Consortium (INTASC) is building on the pioneering work of the 
NBPTS to develop subject-specific portfolio assessments of beginning teachers. 
Their work provides the basis for this study. 

This qualitative study is intended to illuminate the factors that affect the 
generalizability of this portfolio assessment of beginning teachers. By 
generalizability, we refer here to the extent to which the portfolio assessment 
supports generalizations from the particular evidence reflected in the portfolio to 


2 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


the conception of competent teaching reflected in the standards on which the 
assessment is based. Or, more practically, “The key question is, ‘How likely is it 
that this finding would be reversed or substantially altered if a second, 
independent assessment of the same kind were made?’” (Cronbach, Linn, 
Brennan, and Haertel, 1997, p. 1). In addressing this question, we draw on two 
kinds of evidence that are rarely available: comparisons of two different 
portfolios completed by the same teacher in the same year and comparisons 
between a portfolio and a multi-day case study (observation and interview 
completed shortly after portfolio submission) intended to parallel the evidence 
called for in the portfolio assessment. The case studies lasted 3 - 5 days, 
depending on each teacher's schedule. Consistent with Cronbach’s (1988, 

1989) “strong” program of validity, this study is explicitly disconfirmatory, it is 
intended to illuminate potential problems with assumptions about 
generalizability. Our formative goal is to raise issues that assessment 
developers and users can take into account in designing assessment systems 
and appropriately limiting score interpretations. 

Conceptions of Generalizability 

Messick (1989, 1996) characterized generalizability as "an aspect of construct 
validity" that is meant to "ensure that the score interpretation not be limited to 
the sample of assessed tasks but be generalizable to the construct domain 
more broadly" (1996, p. 250; see also 1989). He noted that generalizability has 
two important senses: (a) "generalizability as reliability ... refers to the 
consistency of performance across the tasks, occasions, and raters of a 
particular assessment which might be quite limited in scope" (p. 250) and (b) 
"generalizability as transfer ...refers to the range of tasks that performance on 
the assessed tasks is predictive of" (1996a, p. 250). Thus, inferences about the 
broader domain (in our case, competent teaching performance as defined by a 
set of standards) from a particular sample of evidence (as contained in a 
portfolio) can be productively conceived of in at least two distinct steps: from the 
observed performance to the more limited scope of what we will call the 
assessment domain (reliability) and then from the assessment domain to the 
outcome or standards domain (transfer or extrapolation). This distinction 
between kinds or levels of generalization is drawn by others as well, albeit with 
somewhat different language (e.g., Brennan and Johnson, 1995; Haertel, 1985; 
Haertel and Lorie, in press; Kane, Crooks, and Cohen, 1999) (Note 2). 

Within psychometrics, generalizability has typically been evaluated in terms of 
quantitative indicators of reliability or transfer. These concepts from 
psychometrics will be useful-even though this is a qualitative study-for helping 
us frame and learn from the results of our comparisons. The comparisons we 
offer will in turn, suggest the limitations of conventional theory for illuminating 
the complexity of variations involved in teaching practice and making well 
warranted decisions that accommodate that variation. 

This first level of inference (reliability) involves generalization from a set of 
representative observations to a well specified assessment domain (or universe 
of generalization) consisting of similar observations (Kane et al., 1999; Brennan, 
2001). We are not simply interested, for instance, in how an examinee 
performed on a particular set of tasks on a particular occasion; rather, we are 


3 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


interested in estimating how an examinee would perform on tasks/occasions like 
these. Further, we want some assurance that the score is not based on the 
idiosyncrasies of a particular judge but that similarly qualified judges would likely 
interpret the performance in the same way. 

Reliability is appropriately conceptualized and investigated as a faceted concept 
that encompasses multiple sources of “error” or variations over which we want 
to generalize (differences in tasks, raters, occasions, and so on that are 
intended as samples from the same assessment domain). A set of scores can 
have multiple reliabilities and errors of measurement depending on which 
sources of variation are taken into account. The appropriate domain of 
generalization, including the sources of variation over which we want to 
generalize, depends on the decision to be made (Cronbach et al., 1997). For 
those sources of variation over which we want to generalize, empirical studies 
that examine these variations — across tasks, occasions, raters, etc. — are 
required to support the generalization. As Brennan (2001) argued, the notion of 
“replication” is central to an understanding of reliability. Generalizability theory 
(Cronbach, Gleser, Nanda, and Rajaratnam, 1972; Brennan, 1983; Shavelson 
and Webb, 1991) is, perhaps, the most commonly used theoretical model that 
enables the effects of various sources of error to be “disentangled” and 
estimated simultaneously, although other models, especially those based on 
Item Response Theory (IRT) (e.g., Engelhard, 1994, 2002; Myford and Mislevy, 
1995; Wilson and Case, 1997) are becoming more widely used (see Mislevy, 
Wilson, Ercikan, and Chudowski, 2002, and NRC, 2001a, for a discussion of 
alternative models). (Note 3) With generalizability theory, reliability is idealized 
as a statistical generalization based on “random” samples from the assessment 
domain. Brennan (2001) acknowledged that the notion of random sampling is an 
“idealization that is not fully supported”, but noted that “the central conceptual 
distinction is not so much between fixed and random in the literal sense of 
‘random,’ as it is between fixed and ‘not fixed’” (p. 302). Reliability estimates can 
be quite misleading if a facet that varies in the assessment domain (possible 
essay prompts, for instance) is not included in the estimated error of 
measurement. An unfortunate practice is reporting reliability estimates for 
performance assessments based on differences among readers but ignoring 
potential differences among tasks even though the intended generalization is to 
a broader domain of tasks like these. This can seriously overestimate the quality 
of the generalization to the intended assessment domain. 

Turning to the second level of generalization (transfer or extrapolation), this 
involves generalization from the more limited and carefully specified assessment 
domain to a broader outcome domain, which includes the full range of 
performances about which we would like to generalize. As Kane and colleagues 
noted, most educational concepts are quite broad; rarely are we interested 
simply in how examinees perform on other (test) items like these. Using reading 
comprehension as an (often cited) example, the outcome domain of interest 
might include a wide range of types and genres of text (e.g., newspapers, 
magazines, novels, instructional manuals, technical reports, text books, friendly 
letters, business letters, signs, forms, lists, tests), read for a variety of purposes, 
in many different contexts, requiring various kinds and depths of background 
knowledge to understand, with readings represented in multiple ways (writing, 
conversation, mental images or concepts, drawing, marks on answer sheets, 


4 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


and the like). 

This level of generalization clearly spills over the bounds of reliability into validity 
more generally and typically involves a more tenuously warranted set of 
inferences. Warrants for transfer generalizations include logical or theoretical 
arguments about the relationship between the assessment domain and outcome 
domain. A common approach “is to argue that the skills needed for good 
performances in the universe of generalization (e.g., problem definition, problem 
solving) are essentially the same as, or are a critical subset of, those needed in 
the full target domain” and “that anyone who performs well on the assessment 
should also be able to perform well in the target domain and anyone who 
performs poorly on the assessment should also perform poorly in the target 
domain...” or at least that “the skills being assessed are necessary (if not 
sufficient) for effective performance in the target domain” (Kane et al., 1999, p. 
11 ).' 

Empirical studies supporting transfer generalizations might involve “criterion 
studies,” examining of the relationship between test performance and some 
“especially thorough (and representative)” sample from the outcome domain 
(Kane et al., 1999, p. 10) or, more practically, a “series of small experiments 
regressing various outcomes on test performance” (Haertel, 1985, p. 35). Given 
the near infinite range of possible studies, some means of deciding which are 
most important to undertake given limited resources is necessary. As Kane and 
colleagues noted, “in practice, the argument for extrapolation is likely to be a 
negative argument.” 

A serious effort is made to identify differences between the universe 
of generalization and the target domain that would be likely to 
invalidate the extrapolation. If no major differences are found, the 
extrapolation is likely to be accepted. If the impact of some 
differences on the plausibility of extrapolation is unclear, it may be 
necessary to check on their importance empirically. (Kane et al., 

1999, p. 11) 

Empirical Evidence of Generalizability 

With Performance Assessments, In General 

With performance assessments, the most commonly examined sources of error 
are those due to raters and tasks. Empirical studies of reliability or 
generalizability with performance assessments are quite consistent in their 
conclusions that (a) reader reliability, defined as consistency of evaluation 
across readers on a given task, can reach acceptable levels when carefully 
trained readers evaluate responses to one task at a time, and (b) adequate task 
or "score" reliability, defined as consistency in performances across tasks 
intended to address the same capabilities, is far more difficult to achieve (e.g., 
Breland et al., 1987; Brennan and Johnson, 1995; Dunbar, Koretz, and Hoover, 
1991; Gao and Colton, 1997; Gao, Shavelson, and Baxter, 1994; Lane, Liu, 
Ankemann, and Stone, 1996; Linn and Burton, 1994; McBee and Barnes, 1998; 
Swanson, Norman, and Linn, 1995). In the case of portfolios, where the tasks 
may vary substantially from student to student and where multiple tasks may be 


5 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


evaluated simultaneously, inter-reader reliability may drop below acceptable 
levels for consequential decisions about individuals or programs (e.g., Koretz, 
McCaffrey, Klein, Bell, and Stecher, 1992; Nystrand, Cohen, and Martinez, 

1993). Adequate levels of score (reader and task) reliability have typically been 
achieved by further standardizing the task directions, choosing tasks with higher 
intercorrelations, disaggregating the portfolio into separate tasks that can be 
scored one at a time, and then estimating generalizability as one would with any 
collection of performance tasks. Brennan (2001 ) cautioned that tasks and raters 
are only some of the sources of error that are likely to matter. He cited other 
sources of variation that should likely be taken into account. These included 
different occasions, both occasions of testing as well as occasions of scoring, 
and different methods of testing as sources of error. He noted that some of 
these, such as different methods, are better conceptualized as convergent 
validity studies (rather than as reliability studies per se). (Note 4) Of course, 
certain types of estimate are often deemed not feasible, including parallel forms 
reliabilities with portfolios and assessments of performance in different contexts 
(ETS, 1998; Harris, 1997; NRC, 2001b; Porter et al., 2003). 

Special studies involving performance assessments have looked at relationships 
among methods of assessment: between multiple choice and performance 
assessment (e.g., Lane et al., 1996; Crehan, 2001); (Note 5) between different 
methods of performance assessment, such as direct observation of scientific 
experiments and analysis of students notebooks (Shavelson et al., 1991 ); and 
between on-demand and school based tasks (Gentile, 1992, in Brennan and 
Johnson, 1995). The general conclusion is that different methods appear to be 
getting at somewhat different constructs (e.g., Brennan, 2001; Brennan and 
Johnson, 1995). Fewer operational assessments in education undertake this 
sort of empirical research, relying instead on empirical evidence of reliability and 
logical arguments about content-relevance and representativeness. And, 
indeed, while the Standards for Educational and Psychological Testing (AERA, 
APA, NCME, 1999), require at least some sort of empirical evidence about 
reliability, they mention “external validity” as only one potential source of 
evidence, but leave the choice of validity evidence up to the assessment 
developer and user. 

Some authors note a tradeoff between these two levels of generalization. 
Strengthening the faithfulness with which the assessment represents the 
outcome domain often undermines the reliability of assessment (as reflected in 
the many technical problems with performance assessment) and enhancing 
reliability, for example by employing a larger number of shorter tasks, 
undermines fidelity (e.g., Kane et al., 1999). 

With Teaching Performances, in Particular 

Research into the generalizability of performance assessment of teaching has 
tended to emphasize much the same sort of evidence described above, 
focusing primarily on consistency among tasks and judges. There are two major 
programs of research that are most relevant to our study, the portfolio 
assessments of the National Board for Professional Teaching Standards and 
the observation/interview assessments of Praxis III. Both of these assessments 
are developed by the Educational Testing Service. 


6 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


National Board’s standards-based assessments are designed to certify the 
accomplishment of experienced teachers with at least three years of service. 
Assessments are developed or underway for over thirty different certificates 
(differentiated by subject area and age of students taught). The ten performance 
tasks that comprised the assessment in each certificate area (when the 
research described here was undertaken) are divided into two parts: a portfolio 
completed by candidates in their home schools across a year and a one-day 
assessment-center experience. The school based portfolio consists of (a) four 
tasks that ask candidates to document their practice, through videotapes and 
samples of student work, and to provide "extensive analytical and reflective 
commentary" (Pearlman, in Jaeger, 1998, p. 191), and (b) two tasks that ask 
candidates to document their accomplishments outside the classroom and 
explain why they are important. The four assessment-center tasks provide 
candidates with materials such as student work samples, assessment records, 
instructional resource materials, or professional reading and ask them to use the 
materials to diagnose the status of student learning, plan instruction, and so on 
(Pearlman, in Jaeger, 1998, p. 191). (Note 6) Each exercise is scored 
independently by two reviewers. The resulting scores for each exercise are 
weighted and aggregated to form an overall composite score for each 
candidate. This composite is then compared to a predetermined passing score. 

The National Board’s Technical Analysis Report (ETS, 1998) described four 
relevant sources of error: 

• Assessors: Would a candidate, given a different set of assessors, fare 
similarly on the assessment? 

• Exercise Sampling: Would candidates perform similarly on a different set 
(sample) of exercises? 

• Assessment occasions: Would candidates fare similarly if they took the 
same assessment on a different occasion? 

• School context: Would candidates fare similarly if they happened to teach 
in a different school? 

They noted that it is not feasible for them to provide evidence of reliability across 
school contexts or assessment occasions. With assessment occasions, they 
argued that there is likely to be a learning effect such that one would expect a 
candidate to fare differently (better) and so reliability may not feasibly be 
assessed. 

They provided empirical evidence with respect to assessors and exercise 
sampling — concluding that both are adequate to support the assessment for its 
intended use (ETS, 1998, p. 125; see also Myford and Engelhard, 2001). (Note 
7) With respect to exercise sampling, they cautioned readers about the limitation 
of such evidence since the set of tasks was explicitly designed to represent a 
multidimensional domain: 

Whether an assessment with the current design can be considered 
to allow for alternative forms in a traditional measurement sense is 
debatable. It is possible to argue that the exercises are but one 
possible sample from a larger domain of accomplished teaching or 
that the exercises, for all intents and purposes, comprise a fixed 


7 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


assessment of accomplished teaching. (ETS, 1998, pp. 107-108) 

This is, in fact, typical of the way in which task generalizability is investigated 
with portfolio assessments (e.g., Klein et al., 1995; Koretz et al., 1992; Reckase, 
1995; Nystrand et al., 1993); what we have is an estimate of internal 
consistency (based on tasks that were designed to access quite different 
elements of teaching practice) and that treats as fixed a wide range of factors 
that may in fact vary. Following Brennan (2001 ), this is not really a replication 
“using two full length operational forms” (p. 313). 

With respect to "transfer," Bond, Smith, Baker, and Hattie (2000) examined the 
relationship between scores on the National Board's assessment (in two 
certificate areas for 65 teachers) and 1-3 hour observations of teaching 
accompanied by interviews with teachers and some students. The casebooks 
produced from the visits were scored according to thirteen dimensions of 
accomplished teaching identified in an extensive literature search. Using 
discriminant analysis, they were able to correctly classify 84% of teachers as to 
whether they had been certified using the National Board’s assessment. Other 
studies are currently underway (see www.nbpts.org). While the National Board’s 
goal was primarily documenting consistency across the sources in support of 
the validity of the NBPTS assessment, our purpose is to illuminate both 
similarities and differences at the level of particularity that qualitative methods 
allow. 

ETS’s PRAXIS series, which is intended for use with beginning teachers, 
involves three sets of assessments: PRAXIS I focuses on basic skills, PRAXIS II 
on content knowledge and general pedagogical knowledge, and PRAXIS III on 
teaching performance. The PRAXIS III assessment involves direct observations 
of classroom performance over a series of “assessment cycles.” An assessment 
cycle consists of a preliminary description of the context, the students, and the 
lesson-to-be-observed, prepared by the beginning teacher; an observation of a 
lesson of instruction by a trained assessor (experienced teacher); and pre and 
post semi-structured interviews. The assessor’s notes are then scored on a list 
of nineteen criteria (that were developed through an extensive literature review 
and job analysis survey) and an overall score given. “Summative decisions are 
made based on cumulated data from two or more assessors based on two or 
more assessment cycles” (Dwyer, 1998, p. 8). In addition to the obvious 
differences in methods, PRAXIS III is intended for use across grade levels and 
subject areas, and the criteria for classroom observation have not been tailored 
to particular subject areas as with INTASC and the National Board. Although 
this leads to a somewhat different emphasis, Porter et al. noted the similarity of 
the PRAXIS criteria to the general principles of the National Board and INTASC. 
While there are multiple studies of assessor reliability, there are no reports of 
generalizability across assessment occasions that we could locate (Dwyer, 

1998; Myford and Lehman, 1993; NRC, 2001b; Porter, Youngs, and Odden, 
2003; Myford, personal communication, 3/5/03; Wylie, personal communication, 
5/2/03). With respect to generalizability across occasions, assessment 
developers caution: 

“The purpose and consequences of the assessment, particular local 

circumstances, and the beginning teacher's level of performance 


8 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


(both absolute and in terms of improvement) are factors that 
determine how many assessment cycles will be carried out. 

Guidelines governing Praxis validity and use prohibit 
decision-making on the basis of a single assessment cycle or on the 
judgment of a single assessor (Educational Testing Service, 1993b).” 
(Dwyer, 1998, p. 171) 

Thus, the comparisons in the study reported here--which involve full length 
replications of portfolio assessments, methods of performance assessment, and 
classroom contexts in which the same tasks can be implemented--begin to 
address an important gap in our understanding of the generalizability of portfolio 
assessments of teaching and, perhaps, of performance assessments more 
generally. 

Research Design 

Our study draws on qualitative methods to address questions of portfolio 
generalizability through comparative content analyses across different portfolios 
and different methods of assessment for the same teachers. Consistent with 
Kane and colleagues’ (1999) conception of a negative argument, built from a 
serious effort to disconfirm, our goal is to illuminate differences that challenge 
assumptions about generalizability. Where to locate these comparisons in terms 
of the level of generalizability described in the previous section is an open 
question. At face value, one might argue that the portfolio-portfolio comparison 
is a reliability issue (different occasions on which same tasks are performed), 
and the portfolio-case comparison is a transfer issue (different methods and 
different occasions). And yet, as we return to this issue after sharing our 
findings, the nature of variations that the different occasions afford makes this 
problem far more complex-as occasion is confounded with uncontrollable 
aspects of context-and raises important questions about the nature of the 
assessment domain to which we can appropriately generalize. These are the 
variations that can be invisible when portfolio reliability is examined via 
intercorrelations among tasks and readers. 

We begin with a brief description of the INTASC portfolio assessment system 
and then describe data collection for the two comparative 
studies-portfolio-portfolio comparison and case-portfolio comparison-which 
were replicated in secondary English Language Arts (ELA) and Mathematics 
(Math). Since the comparative content analyses for both studies follow a similar 
pattern, we describe those activities in a fourth section. While the data sets are 
small from a quantitative perspective (29 comparative cases across the two 
studies and two subject areas), our goal was to understand each comparative 
case in depth and to illuminate issues for assessment developers and policy 
makers to consider. 

INTASC Portfolio Assessment 

The portfolio assessments are intended for teachers in their first, second, or 
third year of teaching. To guide the portfolio assessment, INTASC has 
developed a set of general and subject specific standards based on INTASC's 
Principles for Beginning Teachers and standards from the relevant professional 


9 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


communities. The standards and related assessments are intended to provide a 
coherent developmental trajectory with those of the National Board. The 
assessments ask candidates for licensure to prepare a portfolio documenting 
their teaching practice with entries that include: a description of the contexts in 
which they work, goals for student learning with plans for achieving those goals, 
lesson plans, video tapes of actual lessons, assessment activities with samples 
of evaluated student work, and critical analysis and reflection on their teaching 
practices. Unlike the National Board portfolios (which contain four separate 
entries), these entries are organized around one or two units (8-10 hours) of 
instruction such that the portfolio cannot easily be broken into parts for separate 
evaluation. Judges evaluate the portfolios in terms of a series of “guiding 
questions” focused on the portfolio but based on the standards described 
above; they record evidence relevant to each guiding question and develop 
interpretive summaries or “pattern statements” that respond to the question; 
then they determine an overall decision about the candidate (Note 8). As 
developed by INTASC, the portfolios were intended both for professional 
development and for informing decisions about licensure. Of the 10 INTASC 
states that participated in the development of the portfolio assessment, only 
Connecticut is currently using it to inform licensure decisions. 

For this study, participants were recruited from fieldtests in multiple INTASC 
states in 1998-2000. Because our interest in this paper is about the 
generalizability of portfolios for licensure decisions, we chose to evaluate the 
portfolios using the guiding questions and decision guide as they were used by 
Connecticut for field tests in 1999-2000, even though the participating teachers 
were recruited from multiple states. As it was implemented in Connecticut in 
2000, there were four possible levels to the overall decision: conditional, basic, 
proficient, and advanced. Judges also completed a "feedback rubric" on which 
they selected performance levels that best characterized the portfolio with 
respect to each guiding question. The assessment occurred as part of a 2-3 
year induction program in which beginning teachers who had an initial 
three-year license were provided with a mentor in the first year and the 
opportunity to attend state-sponsored workshops to prepare them for the 
assessment. When fully operational, teachers who did not pass the portfolio 
assessment in their second year would continue in the program for another 
year. If they did not pass in the third year, they would be required to reapply for 
the initial license after successfully completing additional course work or a state 
approved field placement. 

Portfolio-Portfolio Comparison Data Collection 

A small sample of secondary beginning teachers in math (n=7) and ELA (n=6) 
were recruited to complete two portfolios during the same year, choosing 
classes and units of instruction that differed as much as possible within their 
routine teaching assignments. They were compensated for the second portfolio. 
Not surprisingly, it was very hard to find beginning teachers willing to assume 
the burden of two portfolios, and it is impossible to fully understand how these 
stalwart volunteers might have differed from their colleagues. We can say that 
their portfolios do reflect a range of performance levels, teaching practices, and 
school contexts and that their paired portfolios do illuminate an instructive array 
of differences, consistent with the goals of the study. 


10 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Case-Portfolio Comparison Data Collection: 

Another small sample of secondary teachers in math (n=8) and ELA (n=8) was 
asked to allow case studies of their teaching shortly after they submitted their 
portfolios. The sample was recruited to include differences in gender, ethnicity, 
school context, and performance level (based upon a quick read through by the 
portfolio developers). The case studies took place over 3-5 days (depending on 
the teacher’s schedule) during which researchers observed classes; conducted 
entry, exit, and brief daily interviews with the teacher; and interviewed the school 
principal and, if possible a mentor, regarding the support available to the 
teacher. [See Ball, Gere, and Moss, 1998; Moss, Rex, and Geist, 2000a, 2000b 
for fieldwork and case write-up guidelines.] Case study researchers observed 
two classes: the class used in the preparation of the portfolio and a second 
class. As with the portfolio/portfolio comparison, we asked for a class that 
differed as much as possible within the teachers' routine teaching assignment 
(but sometimes we were only able to observe a different section of the same 
class). Our intent was to parallel the information collected in the portfolio as 
closely as possible and to gather additional information about the teacher's 
background, school context, and experience preparing the portfolio to address 
additional questions of fairness. Teachers were given a small honorarium for 
participating in the case study. As before, it is not possible to know how these 
volunteers differed from the larger population of beginning teachers. 

Case study researchers, all experienced teachers in the appropriate field, were 
taken through an abbreviated course of study (with practice and feedback) in 
taking fieldnotes and conducting interviews relevant to the project. Tape 
recordings and artifacts were used as back-up. Field and interview notes were 
read by a senior researcher and questions of clarification and elaboration were 
raised to guide revisions (which could be supported with audio-recordings and 
artifacts). Case study researchers were then asked to draw on their notes in 
responding to the Guiding Questions used to evaluate the portfolios. Again, a 
senior researcher reviewed the responses (with fieldnotes at hand) and raised 
questions to facilitate revision. 

Comparative Analyses 

The comparative content analyses for both studies were undertaken in a similar 
fashion. Research assistants (experienced teachers in the content area with 
graduate research training) used the guiding questions (and the dimensions 
contained in the related feedback rubric) to develop a coding scheme for the two 
sources of evidence. Videotapes were roughly scripted for coding. Then, 
answers to each of the guiding questions were developed for each source 
based upon a comprehensive review of the evidence, including the search for 
counter examples to challenge developing interpretations. Similarities and 
differences were then noted, organized by guiding question and overall. 
Justifications for perceived differences in performance level with respect to the 
criteria were developed. For the portfolio-portfolio comparisons in ELA, each 
pair of portfolios was read twice, in reverse order, by two research assistants, 
who then met to develop a consensus on any differences. (Note 9) For the 


11 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


portfolio-case comparisons and the portfolio-portfolio comparisons in math, a 
single comparison document was developed, and the process was audited by 
another researcher. The comparative content analyses typically took 3-5 days 
per teacher and generated 30-70 pages of text each. These comparisons were 
then condensed into 2-3 pages versions that highlighted substantial differences 
both at the level of the guiding question and overall. 

It is important to note that we have, for the purposes of this paper, bracketed 
questions about consistency among readers. Elsewhere we address concerns 
about differences in the way knowledgeable readers evaluate portfolios in 
different social settings when trained to reach consistent decisions and when 
allowed to draw on their own criteria of competent teaching (Moss and Schutz, 
2001; Moss, Schutz, Haniford, and Miller, in preparation; Schutz and Moss, in 
press). Here, we present findings whose validity is based upon in-depth 
analyses, in which relevant differences in perspective between readers were 
resolved through consensus seeking dialogue. The issue for us is not the 
validity of a specific score; rather it is the validity of an interpretation of 
difference between two portraits of teaching and an argument for whether the 
observed differences are likely to matter in light of the evaluation criteria. We 
present our evidence for which differences are likely to matter in sufficient detail 
that readers can reconsider these judgments for themselves. 

Structural Differences Between Data Sources and Asymmetrical 
Questions of Comparison 

By structural differences, we mean those differences between data sources that 
could be anticipated in light of the different methods and which are, in fact, 
typically present in our data. With respect to the portfolio-case comparisons, 
beyond the obvious differences in data collection methods, it is important to note 
the following. While we attempted to have case study researchers present on 
days when teaching consistent with what is expected in the portfolio was 
occurring, it was not always possible to observe all the aspects of teaching 
called for in the portfolio. For instance, while the ELA portfolio required evidence 
of students' response to literature and students’ processes in writing, the 
lessons observed in the case study might not cover both areas. The case-based 
evidence is typically weak with respect to formal assessment procedures since 
often no formal assessment was occurring. However, the case study provides 
substantially more evidence about daily classroom interactions. The case also 
provides rich information about the context in which the teacher worked and 
about teaching practices not foregrounded in the portfolio evidence. With 
respect to the portfolio/portfolio comparisons, the portfolio completed second is 
invariably shorter, often considerably so. It contains typically fewer artifacts and 
shorter commentary (sometimes with reference back to the first portfolio). This 
caused us to develop an asymmetrical comparison and research question'. 

To what extent does the second portrait (case study or second 
portfolio) cause us to reconsider the evaluation of the teacher's 
performance in (what we'll call) the primary portfolio? 

Findings 


12 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Our comparative analyses were set up to uncover differences in the two 
portraits of beginning teachers and to evaluate whether the differences were 
likely to result in different decisions in light of the INTASC standards (as 
instantiated in the guiding questions and the decision guide as adapted and 
used in Connecticut in 2000). We make no attempt to estimate the frequency 
with which these sorts of differences are occurring; our evidence is not 
appropriate for that purpose. Again, our formative goal is to illuminate issues for 
assessment developers and users to consider in designing an assessment 
system, characterizing the appropriate domains of inference, and limiting 
interpretations appropriately. 

We present our findings in the following sections: We begin with an overview of 
the variations in context of the classes and units selected by these teachers. 
Then, we illustrate our comparison methodology in substantial detail with 
comparisons in both math and ELA. In the first comparison, we provide an 
example of a case in which the differences observed do not seem to matter in 
terms of the relevant criteria (which was, we should note, true in the majority of 
cases). In the second comparison, we provide an example in which the 
evidence in the portfolio is, we’ll argue, ambiguous, because the artifacts 
(videotapes, handouts) only partially support the written representation of the 
class; the case appears to clarify the ambiguity. Whether this is a difference that 
would “matter” depends on how the portfolio readers weigh the partially 
conflicting evidence. Thus this comparison, even more so than the first, 
illustrates some of the interpretive problems we encounter with these sorts of 
data — problems that we have tried to address through far more in-depth 
readings than would be possible in operational use. Then, we present a series 
of briefer vignettes that describe situations in which the second portrait caused 
us to question the conclusions we drew from the primary portrait. 

Contextual Variations 

As indicated above, for both sets of comparisons, we asked teachers to choose 
a second class that differed as much as possible within their routine teaching 
assignments. The classes they selected are presented in Tables 1 and 2 for 
secondary ELA and math teachers respectively. Given their selections, it is 
important to note the many different kinds of (often intersecting) contextual 
variations that are present in the comparisons we examine. These include: 
different sections of the same course (which entail differences in time of day and 
whether the teacher has taught the lesson before); differences in (perceived) 
ability levels and groupings of students, including those designated by the 
school directly (remedial, AP, and the like) or indirectly (scheduling in ELA 
resulting from math assignments) and those perceived by the teacher; different 
courses; different grade levels; different units within the same course; 
differences in (mix of) cultural backgrounds of students; different times of year 
(which involves differences in teachers knowledge of and relationship to 
students); differences in class sizes; differences in availability of curriculum and 
support materials; differences in extent to which these materials are consistent 
with the standards. These are all variations that are fixed for a given portfolio 
assessment of a teacher and are unexamined when all we have is the single set 
of performances from a given class. (Note 10) 


13 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Illustration of Analysis in ELA with a "Complementary" Portfolio/Case 
Comparison 

To illustrate our comparison methodology in ELA, we focus on one 
portfolio/case comparison, "Ms. Bertram (Note 11)," in which the activities we 
observed differed substantially, and yet we found the portrait of the teacher 
conveyed in the case study provided quite consistent evidence with respect to 
the general evaluation criteria. We illustrate this comparison in some detail both 
to document our practices of analysis, and to show how two quite different 
activity contexts can nevertheless support similar conclusions about the teacher. 
We begin with a discussion of the ELA portfolio guidelines, the guiding 
questions (developed by INTASC and revised by Connecticut) to evaluate 
completed portfolios, and the way in which we applied them for this study. Then 
we return to the specific case of Ms. Bertram. 

The ELA portfolio handbook asks candidates to complete two distinct entries: 
one each in teaching response to literature (RL) and processes of writing (PW). 
Teachers may choose the same class or different classes for these two 
components. Across these two exhibits, we have the following sources of 
evidence from each teacher: (a) the teacher's rationale for her choice of 
literature and writing assignment, (b) the teacher's daily logs for 10 lessons in 
which she describes the activities she and the students engaged in (providing 
copies of instructional artifacts) and writes brief reflections about how the day’s 
lesson went, (c) video tapes of two-three activities reflecting different 
participation structures (d) teacher's reflections on the videos, (e) five samples 
of student writing, including multiple drafts, with teacher's comments on the 
writing, (f) the teacher's reflections on the students’ writing, and (g) the teacher's 
general reflections on her teaching in the unit. 

In the case study, we have fieldnotes depicting the activities and the discourse 
in the classroom across three days for each of the two classes. Through notes 
from a series of interviews, we learn about the teacher’s goals, her specific 
plans for daily lessons, her reflections on how the lessons went (in general and 
for particular students), and her goals for professional development. 

The guiding questions for ELA (initially prepared by INTASC and revised by 
Connecticut for use in 1999-2000) are organized into four separate categories. 
(1) Questions about literacy focus on "connections among responding, 
interpreting, and composing" with an emphasis on the extent to which students 
develop their own meanings. (2) Questions about instruction focus on how the 
teacher organizes students' learning-including questions about alignment 
between goals and instructional strategies, about integration of activities within 
and across lessons, and about materials-with an emphasis on the extent to 
which instruction provides learning opportunities (challenges) for all students 
and promotes independence. (3) Questions about analysis of learning focus on 
formal and informal assessment of students’ work-how the teacher monitors 
students’ progress, communicates with them about their learning, and uses that 
information to inform instruction. (4) Finally, questions about analysis of 
teaching focus on how the teacher reflects on student learning and uses that 
reflection to inform her practice. (Note 12) 


14 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


While some of the guiding questions were quite specific and descriptive in 
nature (e.g., "Describe how the teacher helps students use a writing process, 
including context, purposes, and conventions of standard written English.")-, 
others involved much higher levels of inference that required integrating multiple 
types of evidence (e.g., Describe the ways in which the teacher creates a 
learning environment that provides all students with opportunities to develop as 
readers, writers, and thinkers"). In analyzing the portfolios, we found ourselves 
following a multi-step process. We began with describing the various sources of 
evidence (e.g., describing the teacher's goals, outlining the progression of 
lessons, scripting the videotape, characterizing the artifacts in terms of the 
nature of students’ responses and any written comments by the teacher; 
illustrating the ways in which teachers reflected on their students' work in their 
commentary). Then we developed interpretations that coordinated various 
sources of evidence (e.g., considering the relationship between the teacher’s 
goals and the progression of lessons to evaluate alignment and scaffolding or 
between the teacher's commentary on the video and what we had observed to 
evaluate quality of reflection). Finally, we moved to the level of responding to 
some of the higher-inference guiding questions (e.g. "Describe how the teacher 
uses knowledge about students to meet their needs in instruction and provide 
them with opportunities to learn" or "Describe the ways in which the teacher 
creates a learning environment that provides all students with opportunities to 
develop as readers, writers, and thinkers.’’). For the case studies, the fieldnotes 
from the classroom observation and notes from a series of interviews with the 
teacher allowed us to engage in much the same process. Our task was 
somewhat easier as the case study writer had constructed responses to the 
guiding questions that drew on evidence form the field and interview notes. We 
nevertheless reviewed the field and interview notes in light of the case study 
writer’s conclusions and often included the additional detail in our comparisons. 

The Appendix provides brief excerpts from the 70-plus page portfolio/case 
comparison document prepared by LeeAnn Sutherland. It shows brief examples 
of the sort of evidence we have from the two methods and illustrates the way we 
have combined the evidence to develop interpretations and comparisons 
relevant to the guiding questions. Below we offer some general conclusions 
based both on the comprehensive evidence in the longer document for which 
the Appendix provides only brief examples. 

Ms. Bertram teaches sixth grade English Language Arts in a middle school 
located in what the case study writer describes as a small, relatively affluent 
suburban community. In the portfolio and the case, we see a "reading and 
writing" class of 24 students who meet daily for two periods. In addition to this 
reading and writing class, the case study writer observed another section that 
covers only writing. 

For the response to literature exhibit in her portfolio, Ms. Bertram selected a 
series of lessons based on the study of a novella commonly used with this age 
group. Three separate tasks require students to (a) identify the character traits 
of the main characters, (b) compose a written response citing which character 
they felt they were most similar to/could relate to best/liked the best and why, 
and (c) use that reflection as the foundation for creating a simulated journal. In 


15 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


the processes of writing exhibit, we see Ms. Bertram guide students through the 
development of a poem using metaphors to describe their mothers in 
preparation for Mother's day. 

The case study describes three days of parallel lessons in two classes where 
the teacher focuses on having students select three best pieces of writing from 
their notebooks, complete evaluation sheets about each one, exchange with a 
partner who would name his or her choice for the writer’s best piece on a 
‘Nomination Ballot,’ revise that piece of writing, and publish it on a web page. 

Even though the activities are substantially different, there is nothing in the case 
that would cause us to question our evaluation of the primary portfolio. Both 
portraits show the teacher using a variety of activities to help students use 
literature to make connections, take others’ perspectives, and explore concepts, 
scaffolding their learning through the activities she creates and the discussion 
she guides. She also uses a variety of activity structures (e.g. small group, 
whole class). We have ample evidence of similar classroom interaction wherein 
the teacher poses questions to which students respond initially (to begin an 
activity or class session), and she then builds from students’ responses to guide 
subsequent questions, consistently validating their contributions. In both 
portraits, we see the teacher employ a variety of strategies to guide students in 
developing as readers, writers, and thinkers. Either portrait would tell us that this 
is a highly reflective teacher who uses that reflection to shape practice 
immediately and to think about changes in her future practice. She consistently 
addresses both the strengths and weaknesses of each lesson as well as their 
relationship to the larger unit. Thus the evidence in the case reinforces the 
conclusions from the portfolio in somewhat different contexts of teaching. 

Illustration of Analysis in Math with an Ambiguous Portfolio and Clarifying 
Case 

Complex evidence of the sort contained in the portfolio and case often presents 
substantial interpretive problems to readers. While this is not the focus of this 
paper, reading problems do impact the nature of the conclusions we draw. Here 
we illustrate our analytic practices with a math comparison and present a 
situation where the evidence in the portfolio is somewhat ambiguous and where 
the additional evidence in the case appears to support one potential portfolio 
interpretation over the other. 

The math portfolio handbook focuses on a single 8-12 hour unit in 
mathematics and requests similar artifacts as requested in the ELA handbook. 
Here the portfolio contains a description of the classroom context, descriptions 
of a series of lessons with instructional artifacts (e.g., handouts, assignments); 
videotapes, student work, and reflections on two featured lessons; a cumulative 
evaluation of student learning with accompanying reflection; a focus on three 
students across the featured lessons and cumulative evaluation of learning; and 
analysis of teaching and personal growth. As with the ELA handbook, then, we 
have partially independent artifacts (including the videotape, instructional 
artifacts, and samples of students’ work) against which to evaluate (some parts 
of) the teacher’s description and reflection/evaluation on what happened. 


16 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


The guiding questions in mathematics are organized into five categories (as 
initially prepared by INTASC and revised by Connecticut for use in 1999-2000). 
(1) Tasks focuses on the appropriateness (variety, richness, challenge, 
accessibility) of the tasks selected by the teacher and on how effectively they 
are implemented (clarity, accuracy, alignment, and responsiveness to students’ 
interests, styles and experiences). (2) Discourse focuses on how effectively the 
teacher orchestrates discourse, uses tools and materials to support discourse, 
and promotes discourse among students in which powerful kinds of thinking 
predominate (defined as students exploring a variety of approaches to problems 
and explaining their reasoning with evidence). (3) Learning environment focuses 
on how effectively the teacher manages the physical, time, and social aspects of 
the classroom and encourages participation and engagement by all students. 

(4) Analysis of learning focuses on how effectively the teacher assesses 
students’ learning (accuracy, variety, and alignment with objectives and tasks) 
and communicates with students about expectations and feedback. (5) Finally, 
analysis of teaching focuses on how the teacher learns from and improves 
teaching. The comparison methodology was similar to that presented in ELA. 
(Note 13) 

The mathematics teacher in this portfolio/case comparison works at a large 
urban high school, in which 68% of students receive free/reduced lunch. The 
portfolio presents an 1 1th grade Integrated Geometry course. The teacher, Ms. 
Fleming, explains that this course is the lowest level geometry course offered by 
the school and that she closely follows the text. The unit presented in the 
portfolio concerns tessellations and triangles. The case study follows the same 
Integrated Geometry class and a 9th - 10th grade Math I class. Ms. Fleming 
reports that Math I is the lowest level math course offered by this school with the 
exception of remedial math. At the beginning of the course the students shared 
textbooks with another class; however, Ms. Fleming indicates that these texts 
soon disappeared. Ms. Fleming uses worksheets left by a previous teacher and 
generates her own curriculum worksheets. She reports that approximately 75% 
of Math I students are failing. The lessons presented in the Math I class focus 
on basic arithmetic, naming of geometric objects and measurements. The 
Integrated Geometry lessons observed for the case focus on triangles, angles, 
and parallel lines. The case was conducted late in the year when both classes 
were reviewing material for a final exam. 

We begin with an extended discussion of the portfolio because it alone raises a 
complex interpretive problem when the partially independent artifacts 
(videotapes, handouts) are compared to the teacher’s descriptions of what is 
happening. In the interest of space, we focus on connections across lessons, 
nature of mathematical tasks, and the implementation of tasks in classroom 
discourse. Then we turn to the case where the portrait of the teacher is 
substantially different from what is portrayed in the written portion of the 
portfolio. [Both descriptions draw heavily on Pamela Geist’s extended 
comparison document.] 

The evidence in the portfolio creates a picture of a teacher that sees how the 
mathematics of a unit connects across ideas and to prior and later learning. 

Early in the portfolio, Ms. Fleming describes some of the mathematical 


17 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


connections she believes are important for students to understand. She writes, 

Knowing the properties of triangles is important to the student of 
mathematics because it is the starting point for learning the 
properties for special triangles and enclosed figures, namely 
polygons. For example, the Triangle Angle-sum Theorem can be 
used to derive the sum of the interior angles of a quadrilateral and 
convex polygons. It also lays the foundation for students to learn 
about pyramids and other three dimensional figures. 

Ms. Fleming describes in detail how the seven lessons across the unit connect 
mathematically and what students will learn across the unit to accomplish 
learning goals and objectives. For example, she explains, 

It was important to show the relationship between the exterior angle 
of a triangle with the adjacent interior and remote interior angles. 

Once the properties of a single triangle were established, it was 
necessary to establish the relationship between a pair of congruent 
triangles and how to use the postulates to establish congruence. In 
order to establish this, students had to learn how to make 
congruence correspondence and congruence statement. Of course 
this also leads us to establishing proof, but my department 
recommended not to introduce proofs with this level class. 

In the portfolio, Ms. Fleming develops a strong case for the predominance of 
discovery-type tasks and learning. She explains that hands-on discovery type 
tasks dominate her practice and that students learn best in these types of 
lessons. She writes, 

The tasks that are most effective are of a ‘discovery’ or ‘hands on’ 
type... [explaining] when my students “see it” and “find it” the 
learning is retained .... I try to let my students have the experience of 
discovery even when it seems small. I have found that using the 
discovery method works best for my students so I have tried to use it 
often... 

Using this method students get to see ideas. For example, when 
students put together the angles from a triangle and actually saw 
that it made a straight line, they knew that adding the interior angles 
of a triangle would equal 180°. They saw that it worked for all 
triangles regardless of the size and shape. 

In the portfolio artifacts, we see a range of tasks including tasks that appear to 
offer opportunities for discovery and those that focus more on recall and 
application of definitions and facts. Consistent with the teacher’s description, a 
series of problems presented in one of the instructional artifacts asks students 
to work at drawing and measuring the various angles within the triangles and 
record their data. The questions that follow ask students to detect a pattern in 
the data and develop statements or conclusions about various relationships 
within the triangles. There is much writing in the portfolio explaining that these 
tasks and others like them are selected because they support students’ 
opportunity to formulate conjectures, reason about mathematical ideas, and 


18 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


justify results. The teacher also provides evidence of other types of tasks that 
are designed to check on students’ general understanding of geometric shapes 
and their properties. For instance, they ask students to classify shapes, recall 
definitions and theorems, use definitions and properties to find other measures, 
and to justify answers with a known theorem or definition. The tasks appear on 
daily activity worksheets, homework assignments, and on assessments such as 
tests or quizzes. 

We turn next to the videotape to see how these sorts of tasks are implemented 
in classroom interaction. Here, we focus in detail on one of two videotaped 
lessons providing excerpts from our rough transcript of the videotape and Ms. 
Fleming’s reflections on what occurred. The videotape is less effective in making 
the teacher’s case for discovery-type learning. We see Ms. Fleming guiding the 
discourse with students responding in short statements that restate a definition 
or fact. The excerpt we’ve selected begins about 4 minutes into the tape after 
Ms. Fleming has finished reviewing, through brief question and answer 
segments, the previous day’s lesson for students. 


T Okay. So let’s look at a triangle (she has an example on the overhead - 
(4:37 into the tape)). We have remote interior angles, we have exterior. We 
have adjacent interior. Let’s look in relationship to this one angle (points to 
image on the screen). Here we have an exterior angle. It’s outside the 
triangle. The adjacent interior is the one that is what? 

S Sharing the same sides. 

T Sharing the same sides. So it’s adjacent. Adjacent means? 

S Next to. 

T Next to, okay? Remote means, we said? 

S Far away. 

T So these two are faraway from this exterior angle. Right? These are going 
to be your remote interior. 

S (unintelligible) 

T Now look at this, if I said 2 is you exterior angle, what is the adjacent angle 
for 2. Where would it be located? 

S (A student is asked to come up and point out the specified angle. Other 
students were calling out some helpful comments as well as “I know, I know” 
as he points to various angles the teacher asks him to identify - remote 
interior. She asks him to confirm that he is pointing to remote interior or 
remote exterior. He confirms remote interior). 

T Very good. So depending on which angle you pick, your remote interior 
angles will be switching sides. Okay, this is my exterior angle 1, right (she 
points)? So what angle is adjacent to that and inside the triangle? 


19 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


S What’s adjacent? 

T Adjacent means next to, it’s touching. It’s sharing the same side. So angle 
A over here would be CAB. Correct? CAB is adjacent to angle ...? I’m looking 
at angle 1. What is it? 

T I’m looking at angle 1. What kind of angle is angle 1? 

S Exterior 

T What is the adjacent angle to angle 1 ? 

S 4 

T What are the remote interior angles? 

S 5 and 6 

T 5 and 6. Much better. 

T & S (At 7:28 in the tape) (Some students need clarification so students and 
teacher have a brief discussion on the different types of angles and their 
relationships to each other). 

T (Teacher moves around the triangle she has on the overhead and asks for 
students to quickly identify remote, adjacent, and exterior angles). 

T Now, today you’re going to look at the relationships between these angles. 
Okay? I’m going to hand out a worksheet and you’re going to do that. 

[Break in sequence. Students are now working together in small groups on 
the assigned worksheet. The teacher walks around to answer questions and 
check their work. Students are comparing work. They use rulers and 
protractors to measure angles and help each other construct the various 
triangles on the worksheet.) (It’s hard to hear what students are saying to one 
another but the teachers voice can be heard from time to time.)] 


In reflecting on this lesson, Ms. Fleming describes how she interacted with 
students to arrive at a solution: 

I did not offer ‘answers’ for the students, but guided them using 
questions to arrive at a solution. For example, when the male student 
attempted to identify the angles on the transparency, I realized that 
he was trying to ‘bluff’ his way out of it. I guided him by repeating the 
names of the angles, emphasizing the words adjacent and interior. 

And about the small group time, she writes: 

When a student asked me if she measured an acute angle correctly, 
(she did not by reading the protractor incorrectly), I asked her if her 
angle was greater or less than 90°, and if her answer made sense. 


20 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


When student A asked me about the measures of her angles, I 
asked her how she could check them. Once she ‘got it’, she 
proceeded to help another member of her group. 

This is not an unreasonable representation of what occurred, and it helps us 
understand why she made some of the choices she did. Viewed in light of this 
interpretive commentary, and taken together with description of all the lessons 
that reflect a privileging of discovery-type learning, it is possible to situate the 
evidence in the videotape within a larger picture that mitigates its dominant 
impression. While not presented here, the other videotape and reflections 
surrounding it raise similar issues; the teacher’s description surrounding the 
lesson creates a different image than what we might infer from the videotape 
alone. 

As teachers, we know that even in the most ‘learner-centered’, 
discovery-oriented classroom, there are often (with good reason) stretches of 
dialogue that resemble what we see here. That we can’t hear what is happening 
among the students on the videotape allows the teacher’s characterization to 
shape our impressions. Viewed alone, this portfolio can be constructed as a 
relatively strong performance, better than just passing, even though the 
evidence provided by the artifacts is a bit uneven. 

Turning to the case study, what we see reinforces what we see in the videotape 
and, taken together with what the case study researcher reports from his 
interviews with the teacher, presents a substantially different portrait. We focus 
on the same aspects of the teacher’s practice, presented in essentially the same 
order: connections across lessons, nature of tasks, and implementation. About 
her characterization of connections across tasks, the case study researcher 
writes: “In the pre-lesson interviews when asked to describe her objectives for 
the next day’s lesson, they were always in terms of discrete topics to be 
covered, sometimes by book chapter. For instance, Ms. Fleming explained her 
plans for her Math I class, 7 am presenting material that is very close to that 
they are seeing on the exam. I will do Chapter 10 tomorrow, reading graphs, 
finding mean, median, mode, and range." Or, following a Geometry lesson, she 
says: “they can do triangles, but not the parallel lines. I keep throwing these 
[parallel line problems] at them so they keep seeing them.” In his search for 
counter evidence about this developing pattern, the case study researcher 
offers the following quotation: 

We do introduce the concept of showing how triangles can be 
congruent and we will ask them to give reasons. The last thing they 
were doing was perimeter and area for rectangles, parallelograms, 
and other quadrilaterals. We did Pythagorean Theorem and area 
under the curve using a trapezoid. And we try to reason with them by 
making ‘cubes’ under the curve and having them count the ‘cubes’. 

The case study researcher argues that the teacher merely makes mention of 
ideas that were presented in earlier lessons or other contexts but did not offer 
any deeper understanding of how ideas are connected, only that they are. 

The case study researcher develops a very different portrait of the dominant 
kinds of tasks offered and the kinds of learning they promote. The case study 


21 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


writer concludes, “there is little diversity or richness in the problems offered,” 
and, “the majority of tasks are one-step applications of definitions and 
theorems.” He describes, “Both in the integrated geometry and in the geometry 
content of the Math I course, she emphasized fundamental skills such as 
naming and applying simple definitions. In the first period I observed of 
Integrated Geometry the first set of problems all are based on knowing 
definitions (e.g. altitude, median, congruent) or theorems (e.g. corresponding 
angles congruent). The questions are one-step applications of definitions where 
the only probe mentioned in the problem, ‘How do you know?”?’ is more a 
reference to naming the correct specific theorem used to solve the problem.” He 
presents numerous examples of these kinds of tasks. He concludes, “Ms. 
Fleming’s objectives across the tasks she offered were centered around 
coverage of facts, definitions and theorems students had memorized and not on 
the development of particular skills or understandings of broader concepts.” 

The case study writer offers a description of the typical discourse pattern, 

“There was one dominant pattern of interaction around the tasks offered in Ms. 
Fleming’s classes. My characterization of this pattern is based on three 
observations in the context of two different mathematics classes- Integrated 
Geometry and Math I. Ms. Fleming offered students a set of problems, similar to 
those given earlier, in the form of a worksheet. Ms. Fleming engaged students in 
a Question-Response-Evaluate type of dialogue around the problems offered on 
the worksheet. The pattern consisted of Ms. Fleming going over the problems 
with the students as a whole class. She would move in order through the 
problems on the worksheet they were currently discussing and for each 
question, the pattern would be essentially the same.” The case study writer 
explains, “As can be seen from the example, Ms. Fleming asks a question, or 
reads a question from the sheet to initiate the conversation; next, a student 
responds to the question with a specific piece of information, either a number, 
theorem name, or yes/no with little or no emphasis on reasoning or justification; 
in the next turn of talk Ms. Fleming evaluates the student’s response, and then 
either gives a correct answer if the student answer is incorrect, or poses a new 
question, which implies the student answer was acceptable.” He notes two 
exceptions to the pattern: (a) a TV game simulation where students are allowed 
to call on one another for help if they aren’t confident of the answer to a 
question and (b) a small group activity where students worked together on a 
problem in groups of three or four where, he notes, the groups often took on 
much the same dynamic as the class overall: students worked on the same 
problems and usually agreed on an answer, which other students in the group 
then copied from those who ‘got it’. 

Which portrait presents the more credible representation of the teacher’s 
practice? What might explain the differences? The difference in the quality of 
representation and reflection could be attributed, in part, to the differences in 
format: spontaneous comments in informal conversations and unprepared 
interviews are unlikely to show the depth of the teacher’s considered reflections. 
And, the written reflections may have been completed with full access to 
curriculum resources and feedback from colleagues. The handbook in fact 
encourages collaborative reflection with colleagues. It’s also important to keep in 
mind that the case study occurred at the very end of the year when the teacher 
was reviewing for the final exam. Does this make the classroom discourse 


22 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


atypical? Given the evidence, it is impossible to know. [We address the issue of 
ambiguous evidence in more detail in Schutz and Moss (in press).] Whether this 
would count as a difference that matters depends, in part, on how portfolio 
readers cope with the ambiguous evidence in the portfolio. 

Additional Comparative Vignettes 

We examined all 29 of the comparisons in ELA and Math at the level of detail 
described and illustrated in the previous two cases. In this section, we present 
vignettes from five additional comparisons in which the differences we observed 
did seem to matter in terms of the relevant criteria and raise, we argue, 
dilemmas that assessment developers and policy makers should consider in the 
design of assessment systems. 

Consistent with the intent of the paper, our vignettes are developed to 
foreground important differences for a particular comparison; we do not 
describe, as we have above, the similarities in these comparisons. In the 
interests of space, we summarize our conclusions with brief illustrations. [We 
hope the extended examples described above and in the Appendix illustrate the 
attention to detail that underlies these conclusions.] Each vignette follows a 
similar pattern: we first characterize the issues and context differences that the 
vignette raises (so that readers can choose whether to read the vignette) and 
then we provide brief illustrations of those issues. (Note 14) This section 
concludes with a brief mention of additional sorts of differences we noted but 
thought were unlikely to matter in terms of the criteria used. We reserve 
discussion of the issues the vignettes raise until the final section of the paper 
where we propose some possible paths for resolution. 

Vignette 3: Mr. Richards 

In this portfolio/case comparison we see a case in which an English teacher's 
performance looks substantially different in an honors class than in his third 
level class. Mr. Richards teaches in what the case study researcher describes 
as a rural school of about 600 students, 97% of whom are white. Distinguishing 
among students' placements in the school's tracking system, Mr. Richards 
indicated, " Honors kids are chosen because of their work ethic and their 
intelligence." Students in the second level, “have the work ethic, but they just 
can't grasp the material. They will eventually, but their work ethic keeps their 
nose above water.” For the students in the third level, “the content is watered 
down.” While the portrait of the honors class is relatively consistent across the 
two methodologies, the case study highlights how it is that his beliefs, as well as 
institutional tracking, seem to shape his practice with students in different tracks. 
[The original comparison was prepared by LeeAnn Sutherland.] 

Only the honors class is represented in the portfolio as they complete a poetry 
unit. The case study writer observed the honors class as well as a third level 
class. Mr. Richards explained that the poems for the third level are not as 
difficult; they use a narrative, abridged version of the literature selection from 
their textbook (honors classes read the unabridged version); he uses the 
textbook much more; he has different expectations of students’ writing (the focal 
"correction areas” are different). Mr. Richard’s rationale for his choices of texts, 


23 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


for the activities he employs to engage students with literature, and for his 
implementation of a writing process appear to differ in terms of his 
understanding of their level of ability. 

As the case study writer describes it, both honors and third level students 
readthe same novel at different times during the school year, but Mr. Richards 
assessed their interpretive needs and abilities differently. The goal for third level 
students was more “the story’ and “trying to pick out the basic elements [such 
as] plot, theme." He believed that honors students, however, “can go beyond the 
literal." Honors students had "a lot of discussion ... a lot of note-taking, 
explaining the concepts," whereas third level students answered primarily lower 
inference questions on worksheets. Of honors students, Mr. Richards required 
out-of-class reading and book reports that follow a genre sequence — first 
fantasy/science fiction, second historical fiction, and the like. Students in the 
other class did a single book report on a biography or autobiography, and they 
reported on the book by creating a poster or doing an in-character presentation 
to the class. They ended the school year with a novel based on a made-for-TV 
movie which Mr. Richards acknowledged has "absolutely no literary merit" but 
that he chose "because students like it and because it’s reading." Students 
wrote an essay at the end of the unit, a personal narrative that did not require 
them to make connections with the text itself. 

Another example of the difference in opportunities provided to students 
depending on level was in composition study. Honors students prepared for 
10th grade by writing a persuasive essay that included MLA documentation. 
About writing, Mr. Richards said that honors students would be "mortified" to 
conference with him individually as "they don’t like to be embarrassed ." He 
typically worked with small groups of these students in the first semester, he 
said, but did not require “rewrites" of them in the second semester because they 
had already "mastered the guidelines" of revision. Mr. Richards writes in the 
portfolio narrative about two additional, “authentic" tasks Honors students would 
complete — entries for a poetry contest and composing a group poem to be read 
at graduation. 

In contrast to honors students, third level students met with Mr. Richards for 
individual meetings about their writing. Conferences took place at the front 
center of the room, facing the class, with the teacher seated at a low table and 
the student whose paper was being reviewed seated on a high stool next to him. 
Mr. Richards marked student papers ahead of time so that he would remember 
what to tell them "they need to fix." He "counseled kids" by skimming their 
papers and calling their attention to each item he had marked as problematic. 
The case study researcher observed 15 conferences he held over two days. Mr. 
Richards emphasized form and mechanics in these meetings, including frequent 
references to spelling, contractions, capitalization, use of second-person 
pronouns, writing numbers in word form, and the need to include information in 
its proper place in an essay. Students spoke to answer his questions or to ask 
for clarification of his suggestions. Following those meetings students were to 
"rewrite," which offered two options. Students could “rewrite the entire paper and 
make all the corrections" or they could rewrite problem words ten times, 
sentences three times, and in addition, write three things that they learned. 
Vocabulary study for students in this class consisted of writing definitions, parts 


24 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


of speech, and sentences using each word. 

Vignette 4: Mr. Johnson 

Here we have a portfolio-portfolio comparison across two different subject areas 
within mathematics. Both portfolios were generated by a novice middle school 
mathematics teacher working in a community that he describes as white, 
suburban, and blue collar. Mr. Johnson works at a large middle school with 
about 900 students, almost all of whom are native English speakers. The two 
portfolios present 8th grade math courses; both classes use a popular textbook 
series. The more advanced of the two is an Algebra I course for "average" 
students. The unit presented in the portfolio from this class concerns linear 
relationships, particularly the generation of algebraic equations for lines. The 
other portfolio is from a Transitions course for "general ability" students. The unit 
from this course covers statistics, particularly the generation of multiple types of 
graphs to display data. In a close reading of the two portfolios, important 
differences emerge relating to the use of ‘real world’ applications in classroom 
tasks, to modes of final assessment, and to the role of the teacher in classroom 
activities. [The original comparison was prepared by Jon Star.] 

The first category of difference concerns Mr. Johnson’s use of real world 
examples and concrete materials. Connections to real world examples play a 
very prominent role in the tasks in the Transitions portfolio. Several of the 
lessons in this unit begin with students collecting data that is subsequently made 
into a chart. For example, students count Fruit Loops cereal pieces to determine 
which colors occur most frequently; they work with box scores of a basketball 
game; and they cut paper plates in their examination of pie charts. In contrast, 
context plays little or no role in the Algebra portfolio. Students work exclusively 
with symbolic equations of lines: these equations are never given any referent 
or context, nor are any real-life situations embodying linear relationships 
introduced in class. 

A second salient difference concerns the way Mr. Johnson assesses students at 
the end of the portfolio units. In the Algebra portfolio, the teacher assesses 
students in a traditional manner -- using a written test, administered in a single 
class period. Students are asked to complete 23 problems, all clustered around 
the execution of procedures (finding an equation of a line given a point and a 
slope, finding the equation of a line given two points, and converting a line from 
point-slope form to general form). There is significant repetition: for example, 
clusters of four or five problems look identical, with only the numbers changed 
from one to the next. In contrast, the final assessment in the Transition portfolio 
requires students (in groups) to collect data, construct graphs, and give an oral 
presentation. This assessment takes several days to complete; students are 
assessed on the quality and accuracy of their graphs and on their oral 
presentations. At the conclusion of this assessment, students meet individually 
with the teacher to discuss their grade. 

A third difference concerns the role Mr. Johnson appears to take in conducting 
classroom activities. In the Transitions class, the teacher seems to view his role 
as one of a background guide; his actions and his commentary consistently 
indicate that his goal is to largely remove himself from classroom activity. For 


25 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


example, in one lesson plan, he writes that he plans to "step into the 
background and let students proceed with their work on their own." In another 
lesson, the teacher makes an explicit attempt to re-direct questions posed to 
him back to students (and he subsequently reflects that he was very happy with 
the results). In general, almost all of the Mr. Johnson’s lessons in this portfolio 
consist of students being given a worksheet or an activity to do in groups; the 
teacher spends much of each class in the background, circulating from group to 
group and answering students' questions when they arise. In contrast, the 
teacher portrayed in the Algebra portfolio is much more directly involved in 
student activity. Almost all lessons in the Algebra portfolio involve the teacher 
conducting a recitation: standing in front of the classroom, he demonstrates a 
procedure, asks frequent questions of the class to guide him through his 
demonstration, and then offers problems that the class should do for practice. 
Although students are involved in these recitations via the teacher's questions, 
the teacher is largely controlling the activity and problem-solving that occurs in 
most classes. The teacher writes that he views the recitation style of instruction 
as appropriate for the more advanced Algebra class but less so for the 
low-achieving students of the Transitions class: "Low achievers and behavior 
disorder students could not stand more than ten minutes of lecture... The style 
that works best for them is more of an activity based learning." 

Vignette 5: Mrs. Martin 

This ELA portfolio-case comparison raises a complex chain of issues: (a) we 
see the same class at two different points in time engaging in substantially 
different learning activities; (b) we learn from the case that some practices 
illustrated in the portfolio were not consistent with the teacher’s practice, were 
undertaken because the portfolio handbook prompted them, and were not 
consistent with what she believed were her students’ needs; and (c) this then 
causes us to question the teacher’s judgments about her students’ capabilities. 
[The original comparison was prepared by LeeAnn Sutherland.] 

The teacher in this portfolio/case comparison works at a school characterized by 
the case study writer as a “large inner city school.” The students who attend this 
school are predominantly Hispanic from poor and working class families. For 
this teacher, Mrs. Martin, both the portfolio and the case study are based on a 
9th grade Writing Enrichment course. The course was developed for students in 
a “transitional program” who are “too old to be in Middle Schoof’ at 15-16 years 
of age, but are “earmarked as an at-risk group.” There are 1 3 students in the 
class, 10 of whom are bilingual; 3 are identified as special education students. 
The portfolio literature exhibit is comprised of one 4-page short story and one 
poem which is integrated with the writing exhibit. The case study focuses on a 
drama unit with a two-page play from an adolescents’ literary magazine as the 
primary text. The case study writer observed two sections of the same writing 
enrichment course. 

The texts used in the literature section of the portfolio were selected to focus on 
the theme of strong, courageous women. The teacher characterizes her goals 
as helping students see the connection among these pieces of literature and 
their own lives, wanting them to “see the potential within themselves" . The 
lessons focusing on these texts, across more than 10 of the 15 days 


26 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


represented in the portfolio, take students through a series of activities including 
the completion of several charts focusing on elements of literature such as 
character, theme, and imagery, a 2-paragraph “mini-essay” describing one of 
the characters, and a culminating essay in which students write three 
paragraphs which compare a character from the short story with a character 
from the poem. The comparison writer notes that the time spent on these brief 
and straightforward texts seems excessive and that students have little 
opportunity to develop their own ideas. The teacher’s reflections suggest that 
she believes students need this level of support to comprehend the story. She 
indicates that “students have difficulty decoding words ” and that each story read 
in this course begins with an uninterrupted oral reading (by the teacher) that 
gives the students “the opportunity to hear the story first, get a basic idea of the 
plot of the story, and minimize the frustration of difficult vocabulary.” She notes 
that “focusing on a few skills and then building on them, ensures a complete 
understanding, and more importantly, retention of the lesson. For this group of 
students, retention is the key.” 

In her portfolio, Mrs. Martin provides commentary and videotape of two students 
as they conference about essays they have written. Each of the students offers 
observations to the other, and the author is seen to respond to those 
observations. They discuss thesis statements, paragraphing, use of examples, 
and proper citation form including line numbering [for the poem]. They also 
discuss parts of the essay they found difficult to understand. The comparison 
writer argues that this is one of the better student-student conferences seen in 
portfolios, as participants are actively engaged in dialogue about writing. While it 
is not a substantively rich conference, many of the writing conferences seen on 
videotape are teacher-directed, or the students speak to one another but not 
with one another. The two students seem to “get” the idea of how to engage in a 
writing conference. 

However, when the case study researcher asked the teacher a question from 
the interview protocol, “Did the portfolio involve things that were not part of your 
teaching practice?” Mrs. Martin responded: 7 thought the peer editing and peer 
responses were phony for me because I don’t do that yet. The kids were not 
really ready yet." If the teacher coached the girls on how to talk for the camera, 
then that raises one set of issues. If she did not, but simply asked them to 
conference, even though she usually does not have them do so, then their 
relative success in the conference raises questions about the teacher’s 
judgment of students’ abilities. 

The case study writer saw no writing in response to literature and no process 
writing during the time of his observations. The only writing he saw involved lists 
and definitions associated with vocabulary words. Asked whether what the case 
study writer had observed over the three-day period was “typical of your 
teaching and your classroom,” Mrs. Martin indicated that “the class is a writing 
enrichment class and most of our time for the whole year was spent on 
enrichment.” She stated that previous writing assignments for the course had 
followed the [state’s] test format and students had also written autobiographical 
essays. Neither source of evidence provides examples of these types of writing. 

About literature, Mrs. Martin said that the portfolio requirements — again — did not 


27 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


jibe with her usual practice: 7 think having 7-8 hours of literature did not really fit 
the curriculum I have for these kids." She told the case study writer that these 
students are not ready for a novel or for the “7-8 hours of literature” required for 
the portfolio, that they struggle with decoding, and that students needed to read 
the screenplay twice in order to get it. The case study writer observed the 
teaching of the magazine play, and he reports that students’ oral reading over 
two days was relatively fluent, and though little attention was paid to students’ 
understanding of the play, students’ verbal comments, a question one student 
asked, their expressive reading, and other verbal cues indicated that they did, 
indeed, comprehend this particular text as they were reading. Again, this raises 
questions for us about the teacher’s judgment of her students’ capabilities and 
needs — question that the evidence is insufficient to address. 

Vignette 6: Mr. Gere 

In this vignette, we encounter a teacher who indicates that he chose to use his 
portfolio as an opportunity to improve certain areas of his teaching. In the 
portfolio he presents an Algebra I class, where he reports the students reflected 
a wide range of abilities and dispositions. The case study writer observed the 
Algebra 1 class and a Pre-Calculus honors class. We learn from the case that 
Mr. Gere had originally decided to focus the Pre-Calculus honors class for the 
portfolio but switched to the Algebra I class because he felt the class needed 
extra attention. In the portfolio, he indicates that he hoped the portfolio would 
help him focus in on the difficulties he was having and turn them around. [This 
vignette is based on the comparison originally developed by Pamela Geist.] 

While the two portraits of teaching present quite consistent evidence about this 
teacher’s practice, the interpretive commentary that surrounds them leaves the 
reader with a substantially different impression about the effectiveness of his 
teaching. Unlike the situation with Ms. Fleming, where her interpretation 
highlights the strengths in her practice, Mr. Gere focuses, it seems relentlessly, 
on his concerns about his teaching and what his students are learning. The 
case study, then, presents the teacher in a far more positive light. 

The case study writer, acknowledging the predominance of procedural work and 
the ongoing focus on manipulating expressions and equations, nevertheless 
develops a picture of a teacher who encourages students to look at underlying 
ideas and explore some of the logic associated with working the procedures. 

The case study writer concludes, “In all six of the classes observed, the 
students’ oral responses, questions, homework, classwork and quizzes 
indicated that Mr. Gere’s expectations were accessible to most students.” She 
notes some differences between Pre-Calculus and Algebra: For example, the 
case study writer reports that in the algebra class, the pattern was one of fairly 
routine mechanics; first distributing with algebraic expressions, then factoring 
algebraic expressions, and finally solving quadratic equations that were factored 
and set equal to zero. In the pre-calculus class, although the work appeared to 
be quite mechanical, there was more problem solving involved because of the 
number of possibilities when finding equations that fit a set of data points. 
However, she also notes: “As the material developed over the three days, the 
students played a bigger role in the dialogue, offering their own strategies for 
finding the equation from a set of data points. Mr. Gere also used open-ended 


28 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


questions effectively: for example, about a quadratic equation, in standard form, 
students were asked to give its characteristics, in other words, to tell him what 
they could about this function. The responses were extensive and showed depth 
of knowledge about a quadratic.” The case study writer creates an overall image 
of a fairly successful teacher, one who takes his work seriously, is well-liked and 
respected by his students, and works hard to create a practice that meets his 
goals and expectations. 

The comparison writer notes the similarities between this representation and 
what she sees in the artifacts of the portfolio. The portfolio artifacts show a 
similar continuum of difficulty on daily worksheets, quizzes, and tests. Problems 
begin with simple equations and progress to more complicated ones. Initial 
tasks focus on the procedural steps to solve problems and move toward using 
these steps in context. There usually is one task that requires students to 
explain an idea or the logic underlying steps. In this sense, both reports show 
that tasks become progressively more difficult because they require that 
students know more about the different scenarios represented in algebraic 
equations and how to manipulate more complex expressions and equations. In 
large group work, Mr. Gere demonstrates procedures and talks students 
through his logic of the steps. Mr. Gere asks next-step questions of students 
and students answer Mr. Gere directly. And he effectively and accurately 
demonstrates the procedures for students, using appropriate mathematical 
language and notation to demonstrate how a system is solved, and students 
practice and memorize the procedures eventually making them their own 
process. Mr. Gere promotes student-to-student discourse in the context of small 
group work and pairing students together to complete a task. The video 
evidence illuminates that for the most part students work productively in pairs 
and small groups explaining to each other how to proceed with a task and 
compare procedures and answers with each other. 

And yet, Mr. Gere’s reflective commentary on his practice paints an entirely 
different image of his success. For instance, talking about the difficulties he 
faced in facilitating discussion, he writes “I regretted not soliciting a variety of 
problem solving methods for this exercise and again bypassed potentially rich 
mathematical discussion in the interests of time. The decomposition of the 
problem’s solution into discrete steps was worthwhile and helpful, but again lost 
something due to the more directed discussion that resulted from my sense of 
time pressure .” He notes further: “My responses to students’ questions also 
reflect my impatience such as the response to student A when I don’t even let 
him finish his question before answering. I give him a perfectly accurate and 
reasonable answer, but the tone of impatience is more damaging in other areas. 
Another student question is similar in outcome. I quickly give an accurate 
concise answer to his question but would have benefited the other students with 
the same misunderstanding by instead redirecting his question to a few of the 
weaker students to make sure their understanding was solid." He worries 7 have 
begun to recognize that I have slowly adopted more and more of the students’ 
inclination to ‘just let me see how to do the problem so I can stop thinking.” In 
fact, he describes what he perceives to be an ongoing decline in students’ 
efforts to succeed in his Algebra I class: “I know that the effort level has declined 
precipitously over the past 1 1/2 months in this class, and I worry that I am 
enabling the very destructive tendencies that are plaguing this class." 


29 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Thus, there is a running theme across his reflections, one that details the 
frustrations and disappointments of not being able to change students’ attitude. 
The image he creates in the portfolio is of a teacher struggling with changing his 
teaching and at times, there is a sense of hopelessness. Because there is little 
offered in the portfolio in the way of a rich analysis for how he intends to turn 
this pattern around, the portfolio writing produces the image of a teacher who 
sees himself as mostly ineffective and struggling with supporting richer 
opportunities in the discourse and at the same time offers few ideas for how to 
change current patterns. In effect, the case study report paints a much more 
positive image of the discourse patterns, indicating that procedural goals are 
getting met through the patterns of discourse and at times, especially in 
Pre-calculus, the discourse supports a deeper and richer investigation into the 
mathematical ideas. While the comparison writer, perhaps cued by the image in 
the case study, was able to read behind Mr. Gere’s commentary, this portfolio 
(which was also used in another study involving multiple readers) elicits quite 
different reactions depending on the weight the reader gives the teacher’s 
negative commentary about himself. Whether this is a difference that would 
matter depends on whether or not the portfolio readers are willing and able to 
read behind the teacher’s commentary. 

Vignette 7: Mrs. Jacobson 

In one sense, this portfolio/portfolio comparison provides another example of a 
teacher whose practices look different in classes she characterizes as 
comprised of students with different ability levels. In this case, we observe 
differences in the teacher's demeanor and attitude toward students in the two 
classes. We also note that her expectations, her explanations for her choices, 
and her reflections on students' performance in the second portfolio (unlike the 
primary portfolio) are sometimes framed in terms of cultural and linguistic 
differences. [The primary comparison was prepared by Laura Haniford, drawing 
on documents from Steve Koziol, Leah Kirell, and Suzanne Knight.] 

This teacher, Mrs. Jacobson, submitted portfolios for two 7th grade classes that 
are “theoretically heterogeneous” but that are actually grouped, as she reports, 
based upon the scheduling of math classes. There are 26 students in the first 
class and 28 students in the second class; there appear to be 2-3 students of 
color in the first class while students of color are a majority in the second class. 
The teacher is white. The base text in both classes is a trade novel set during 
World War II and is part of her department’s prescribed curriculum. 

Mrs. Jacobson characterizes her students in the first class as “bright and fun ” 
and states that her expectations for students reading this novel are that they 
“learn the historical and cultural ramifications of World War II. I intended that 
students examine the personal struggle of the innocent civilians victimized 
during the war and the incredible strength and courage of the survivors.” She 
also states that this particular selection exposed the students to diverse 
perspectives “other than the black/white issue which is pervasive at this school." 
In contrast, Mrs. Jacobson characterizes the students in the second class as 
“behavior problems” and her expectations for them are different. She states that 
she would not have chosen this book for them and that “the majority of these 


30 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


students cannot-or will not-read it and understand it. These students are 
intensely committed to being Black or Hispanic and did not relate to the 
Holocaust.... They love violence and injustice — most kids their age do.... [But] 
This was far too sanitized for them." Mrs. Jacobson also states that she is more 
concerned that the students in the second class learn the history of WWII as 
opposed to understanding any elements of plot or character. 

The teacher begins each class with a daily oral language (DOL) experience. In 
class one, this takes many forms - open ended questions about literary terms 
followed by a discussion of some examples from the novel and from students 
own experience; brief comprehension questions on the reading; a vocabulary 
exercise where extra credit is given for making the teacher laugh; a brief review 
of grammatical terms. In class two, the DOL is consistently a recitation/review of 
questions on the assigned reading with answers given “ swift] y” and written 
answers handed in at the end of the week. Commenting on an interaction in 
class one, she writes, “In the future, I might take a hint from this class and 
compare movies they may have seen with the books we’re reading. I always try 
to relate what we’re doing to their own lives, but they like to talk about movies." 
Of class two, she writes “With this group, I have to lead them with a strong 
hand, although I try very hard not to tell them what they ‘should’ say: I want to 
hear what they want to say, even if it is immature or downright silly.” 

In both classes, students’ written response to literature is related to preparing a 
five paragraph theme and to addressing the state’s criteria for a persuasive 
essay. Beyond this, her stated goals for the first class include learning to appeal 
to all five senses, to write "great opening lines" and to "engage" readers; for the 
second class, her goal is getting students to write "something-anything." In the 
first class, the primary assignment asks students to take a position on whether 
they would take in an escaped prisoner of war who came to their home seeking 
refuge. In the second class, the same primary assignment is given, together 
with an alternative prompt related to a reading on the Civil Rights movement, 
because “They identified with the black students 

In class one, the primary assignment is grouped with two others, a personal 
time narrative and a group diary designed to personalize the story for students; 
there are no surrounding assignments in class two and students in the second 
class are not given the opportunities to work with one another that the students 
in class one are. Samples of writing from each class suggest that students 
understood the demands of the assignment and could respond to it 
appropriately. Commenting on her concerns about a student’s writing in the first 
class, she says: He “does not create an effective visual in the opening 
paragraph. Also... [he] does not respond to the opposition in his fourth 
paragraph. He only states that there is another side." Of one student in the 
second class, she writes: “[He] did not follow the guidelines, either. This child’s 
family speaks Spanish in the home, and he had made great improvements since 
September. He has learned to skip the lines to make paragraphs, and he is 
writing sentences, rather than one long sentence." 

The guidelines for the state's writing tests are the focus of formal writing 
instruction in both classes and of video segments on writing. In the first class, 
Mrs. Jacobson’s introduction is brief, mainly an overview, and students have an 


31 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


opportunity to look at some samples and to begin working in a writing workshop 
format on their own essays - students read aloud some of their drafts, they work 
together in peer editing, the teacher guides the critique of samples, drawing 
from the student samples to deal with topics in language use (e.g., using 
over-used words), and she confers with individual students. On the video, the 
teacher circulates around the room, talking with individual students about their 
work. Overall, the interactions appear positive and supportive. 

In class two, the teacher guides students through a series of questions about 
the state’s writing test guidelines, seeking responses about what is to go into 
each paragraph and elaborating on student responses. She moves to a whole 
class example - on the topic of what if there were no teachers - which begins to 
generate student responses, although the teacher appears (on video, as well as 
in her comments) to be frustrated that the students don’t seem to understand 
how to give reasons for the “other side," which she says is required by the 
state’s guidelines. There is no small group work: students are in whole class 
activity or working on their own; when they are writing, the teacher circulates 
and has occasional interactions with students. 

Based on our observation of the video, the teacher's management in class one 
appears to be smooth; students move from one type of activity to another and 
from one arrangement to another with little disruption; the teacher comments 
that this group is especially active and noisy, although that doesn’t appear to be 
evident from the tapes. In the second class, management issues dominate more 
of the dialogue. The teacher writes: “running a discussion with this group is like 
walking through hip-deep jello. With every remark comes ambient noise and 
chatter which drowns it out and everything has to be repeated. In fact, as I 
watched this segment, I was bored just listening to myself repeat the instructions 
more than 10 or 20 times. Virtually nothing got accomplished In addition to this, 
Mrs. Jacobson has several extended disagreements with individual students in 
the second class that are conducted in front of the entire class. 

Mrs. Jacobson’s reflection about her teaching and about students’ learning is 
not detailed or extensive. With class one, she notes that some of her 
assignments were too vague and that she was unprepared for how capable her 
students were, something she would better prepare for in the future. She thinks 
she will add some drama in the future, because the group would have done well 
with this kind of reading and activity. With class two, she notes that she was not 
particularly effective as a teacher, but attributes this primarily to being required 
to teach an inappropriate text and having to follow a district mandate that 
doesn’t fit the students. She notes, “Sadly, any of these students’ real problem 
is behavior. If they would listen, if they wanted to produce, they could. Peer 
pressure and stress at home makes it nearly impossible for them to succeed. 
Patience and in-class time to do their work does increase their chances of doing 
acceptable work.” 

Additional Differences 

Substantive differences existed in all the comparisons, as we would expect in 
any dynamic teaching situation. It’s important to note, however, that in the 
majority of cases examined in math and ELA, we found that the second data 


32 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


source elaborated but did not overturn our general impression of the quality of 
the teacher’s performance with respect to the relevant criteria. In some cases, 
we simply saw the same practices instantiated in a different content; in some 
cases we saw somewhat different practices that, taken together, presented a 
coherent portrait across the two (e.g., Ms. Bertram) or clarified an ambiguity in 
the original portfolio (e.g., Ms. Fleming); in some cases we saw differences 
similar to those we represented here but not so substantial as to overturn our 
judgment of the primary portrait. Portfolios that contained inconsistent 
evidence-in which the artifacts did not fully support the teacher’s descriptions, 
as with Ms. Fleming-complicate the question of portfolio generalizability with the 
problem of interpreting the initial portrait. [We discuss issues of portfolio 
evaluation elsewhere (Schutz and Moss, in press).] In some cases, we learned 
things about the teacher in the case, which may not have been relevant to the 
criteria, but which shaped our judgment of the teacher. For instance, in one 
case (a likely “conditional” score), the case study researcher observed multiple 
situations of conflict, at least one potentially violent, during and outside of class 
that the teacher skillfully resolved. In fact, we often learned about the teacher’s 
relationship to, rapport with, and work with students outside of class. We also 
learned about numerous factors that influenced the teachers’ performances that 
would not likely be mentioned in the portfolio or illuminated by the criteria if they 
were: the presence or lack of a coherent curriculum and/or text that is consistent 
with the standards; the presence or lack of a supportive mentor in the teacher’s 
subject area; large differences in professional development opportunities and 
opportunities for collaborative work with colleagues; differences in resources 
available to prepare the portfolio (including release time and access to video 
equipment for multiple days). Of course, whether and how these factors 
influencing a teacher’s performance should or even could be fairly taken into 
account in this assessment is an open question. 

Conclusions 

As we indicated in the introduction, our goal is to use these comparisons to 
illuminate issues for assessment developers to consider in designing 
assessment systems. Consequently, our analysis was disconfirmatory: It was 
not intended to document consistency but rather to highlight the kinds of 
differences that can occur across different representations of a teacher’s 
practice and that point to potential problems with implicit assumptions about 
generalizability. We want to caution readers against drawing conclusions about 
the typicality of our comparisons. There is no way to know how these 
volunteers-teachers who were willing to complete a second portfolio or to allow 
an observer into their classrooms for 3-5 days-might differ from the larger 
population of beginning teachers. However, the dilemmas we have found-which 
would not be illuminated in data that are routinely collected-highlight important 
issues for educators, assessment developers, psychometricians, and policy 
makers to consider. 

We begin our conclusions with a review of the kinds of differences that seem 
likely to matter (that is, likely to result in different performance levels) in terms of 
the relevant criteria. Then we return to the useful concepts of generalizability 
with which the study was framed. What is the assessment domain (or 
“universe”) to which we can safely generalize? What is the (larger) outcome 


33 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


domain about which we can reasonably draw inferences supported with logical 
arguments and intermittent empirical studies? How consistent are these 
domains with the domain implied in the decision about licensure? We close with 
some more speculative thoughts about the nature of assessment systems (and 
theoretical resources) that might support well warranted decisions about 
teaching performance. 

What Have We Learned about the Generalizability of Teaching Portfolios? 

The comparisons in this study begin to address an important gap in our 
understanding of the generalizability of portfolio assessments of teaching and, 
perhaps, of performance assessments of teaching more generally. Taken 
together, these vignettes raise a number of concerns, some of which relate 
directly to the topic of generalizability and some of which spill over into concerns 
about validity and ethics. 

In the small set of comparisons we’ve examined here, it is very clear that 
context matters. We’ve shown differences in performance across classes that 
differ in (perceived and/or institutionally designated) ability level of students, in 
subject matter taught, and in cultural background of students. For instance, in 
the case of Mr. Johnson, we saw differences in performance across two subject 
matter domains: statistics and linear equations in algebra. Perhaps it is easier 
for novice teachers to develop “rich and challenging” tasks that foster 
“connections” and “reflect students’ interests, styles and experiences” in some 
domains than in others. We've presented two clear examples of differences in 
performance across classes that differ in perceived ability level (Mr. Richards 
and Mrs. Jacobson). And we found other cases (not described here) in which 
the differences were apparent but far more subtle (as might be seen in the 
difference between the Pre-Calculus and Algebra classes of Mr. Gere). In Mrs. 
Jacobson’s case, perceived differences in ability were coupled with differences 
in the cultural background of her students. The differences in performance here 
are more troubling because of the teacher's apparent attitude toward the 
students and tendency to seek explanations of their performance outside her 
practice, in district requirements and in their perceived needs as members of 
different cultural groups. The rubric has no place for descriptions of teachers' 
expectations and, indeed, if it did, it would be easy to coach a teacher to 
eliminate problematic language from her text. 

That context matters will come as no surprise to those who study classroom 
teaching or performance assessment. There are complex and dynamic 
relationships among teachers’ social backgrounds and experiences; their 
expectations, values, and beliefs; their classroom practices; their students’ 
(inter)actions; and the larger social and institutional structures in which they live 
and work (Gallego, Cole, and the Laboratory of Human Cognition, 2002; Knapp 
and Woolverton, 1995; McLaughlin and Little, 1993; McLaughlin, Talbert, 

Bascia, 1990; McNeill, 1983; Stodolsky and Grossman, 1995). Research in 
performance assessment more generally, with tasks that are far narrower in 
scope than those represented in teaching portfolios, shows us that different 
people perform differently on different tasks (the person x task interaction, in 
terms of generalizability theory) which necessarily confound the construct of 
interest with variations in the context in which it is preformed. A recent review of 


34 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


approaches to performance assessment in health professions (Swanson et al., 
1995) leads to similar conclusions about the difficulty of generalizing across the 
contexts presented by different tasks. “Regardless of the assessment method 
used, performance in one context (typically, a patient case) does not predict 
performance in other contexts very well” (Swanson, Norman, and Linn, 1995, p. 
8). The social context of a classroom seems even more complex than that of a 
health professional-patient relationship. While both are certainly equally 
embedded in societal and institutional structures, the classroom involves 
dynamic relationships among as many as 30 - 35 individuals, each with their 
own cultural/personal backgrounds that vary in ways we can’t predict. Gallego 
and colleagues (2002) argued that “every continuing social group develops a 
culture and a body of social relations that are peculiar and common to its 
members.... Hence,... we can expect that every classroom will develop its own 
variant” (p. 992). 

Two recent reviews of assessments of teaching (NRC, 2001b; Porter, Youngs, 
and Odden, 2003) both raised concerns about the lack of evidence of teaching 
performance across differing classroom contexts, and our observations support 
those concerns. It is hard to imagine, however, how a single assessment 
program could adequately (and fairly) address those concerns. One could ask 
for samples of teachers' performance in different classroom contexts, as we 
tried to do, and yet the variations available within teachers’ yearly class loads 
vary quite substantially from teacher to teacher, and all are considerably 
narrower than the range of classes and school contexts in which they are 
licensed to teach. (Note 15) One could imagine other kinds of assessments in 
which teachers are presented with cases from a range of classroom contexts, 
and this might provide some relevant evidence; however, asking teachers to 
plan or evaluate activities for students with whom they have little experience 
would raise other kinds of validity questions. And the experience in 
health-related professions with these sorts of simulations suggests that 
questions of generalizability are likely to remain. There are no straightforward 
solutions. 

The case of Ms. Martin raises a second issue directly relevant to 
generalizability. Here we find a teacher who perceives that she is in the position 
of being required to show evidence of a performance that is outside of her 
routine teaching practice. Does that suggest the portfolio guidelines were too 
directive or restrictive? Experience with National Board assessments has led 
developers to conclude that it is important for teachers to understand what is 
valued in the assessment; being explicit about expectations, within the bounds 
of construct relevance, is considered important for validity and fairness 
(Pearlman, in press a, b), and INTASC has emulated their practice. Clearly, 
portfolio assessments of this sort do not support conclusions about what is 
typical. What we learn with a “passing” portfolio is whether a teacher and a 
group of her students can engage in a particular kind of practice and reflection 
in at least one instance. Teachers may, of course, make choices that are not in 
their best interests, as was the case with Mr. Gere, who chose a class with 
which he was struggling and then emphasized his shortcomings. While this is 
commendable and productive for a professional development activity, it is less 
than strategic for a high-stakes assessment. Careful instructions to candidates, 
and examples of successful portfolios, will be important in helping teachers 


35 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


demonstrate the strength of their practice with respect to the standards. It is 
important to recognize, however, that not all candidates will have commensurate 
opportunities to illustrate their practice. Assessors should try to make sure that 
teachers have the human and material resources they need, including adequate 
time, access to competent mentors, and access to audiovisual services. Of 
course, assessors cannot control teachers’ work assignments or the schools in 
which they work. We have to recognize that these factors influence the extent to 
which teachers can demonstrate a performance consistent with higher scores 
and design a system that is appropriately skeptical of the validity of its 
conclusions about individual teachers. 

Not surprisingly, differences across methods used here also played a role, with 
different methods being more or less adequate in providing evidence relevant to 
different criteria (as we discussed above under structural differences). The 
portfolio typically offered the teachers in our comparison a better opportunity to 
explain their choices and reflect thoughtfully on their teaching (although with a 
skillful interviewer, one could imagine the opposite for some teachers who are 
uncomfortable with writing); the case study provided more evidence (six full 
classes vs. brief videotaped segments from two featured lessons) that allowed 
stronger inferences about the pattern of discourse in class. Of course, either 
method could be revised to better address these concerns. If we want to draw 
conclusions about patterns of classroom discourse, having access to two 
lessons may be insufficient, especially if they do not support the written 
description in the portfolio. Clearly, more research about this would be most 
beneficial. 

While criteria were not varied in this study (and are typically considered fixed), 
there are clearly many different ways to instantiate the INTASC principles in 
specific criteria tied to available evidence. Consider, for instance, the following 
two principles taken from the ten INTASC (1992) principles on which the subject 
specific standards are based: 

Principle #2: The teacher understands how children learn and 
develop, and can provide learning opportunities that support their 
intellectual, social and personal development. 

Principle #5: The teacher uses an understanding of individual and 
group motivation and behavior to create a learning environment that 
encourages positive social interaction, active engagement in 
learning, and self-motivation, (p. 16) 

The portfolio assessment situates evidence and criteria relevant to these 
principles within particular subject matter contexts where particular approaches 
to learning are privileged. Alternatively, assessment developers could, as 
PRAXIS III assessments do, frame criteria and evidence more generally. 
Consider the following criteria drawn from PRAXIS III Domain B: 

B1 : Creating a climate that promotes fairness 

B2: Establishing and maintaining rapport with students 

B3: Communicating challenging learning expectations to each 


36 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


student 

B4: Establishing and maintaining consistent standards of classroom 
behavior 

B5: Making the physical environment as safe and conducive to 
learning as possible 


(Dwyer, 1998, pp. 21-22) 

When we have asked INTASC portfolio readers what they would like to attend to 
that isn’t addressed in the rubric, among the issues that repeatedly arise are 
teachers’ relationships with their students and their classroom management. If 
these criteria are given some or substantial weight in a compensatory 
assessment, a number of teachers are likely to improve their scores. In fact, we 
saw one teacher, working in a large urban school, whose ELA performance 
tended to emphasize form over meaning and single correct 
interpretations — consistent with the lowest performance level-and yet the case 
study researcher saw multiple examples of the teacher handling potentially 
violent conflict among students in ways that successfully diffused the incident. 
Moreover, the teacher’s reflections on this incident were insightful; he 
commented, for example, on how he learned who he could touch in violent 
situations. This observation is not a criticism of the INTASC criteria and 
standards. As Pearlman (in press) points out, every assessment system has to 
decide what it values and then make those values clear to candidates. Contra 
Brennan (2001), treating rubrics as fixed seems the reasonable choice although 
it’s important to acknowledge that changes in the rubric will likely result in some 
changes in who passes and fails (Moss and Schutz, 1999, 2001). 

While it is beyond the scope of this paper to address, our comparisons also 
raise issues relevant to portfolio readers’ evaluation of the specific evidence 
contained in the portfolio. As the case of Ms. Fleming illustrates, portfolios 
sometimes contain conflicting evidence, especially artifacts that do not fully 
support the written representation. As we argue elsewhere (Schutz and Moss, in 
press), portfolios like these can support different interpretations depending on 
how readers choose to weigh the evidence. Based on recorded dialogue and 
extended interviews, we show how readers who clearly value the same criteria 
and describe the same evidence nevertheless construct a different story about 
the teacher’s practice given the evidence in the portfolio. How should an 
assessment system address ambiguous evidence like this? We return to this 
issue as we discuss implications (and in Schutz and Moss, in press). 

The meaning of central terms like “discussion,” “problem solving,” “inquiry,” 
“reasoning,” and so on is also at issue. If teachers and readers hold different 
meanings for these terms — attach them to different actions/interactions — then 
this can affect readers’ understanding of classroom practice (for which there 
may be no accompanying evidence) and/or readers’ evaluation of the accuracy 
of teachers’ reflections. Further, the issue of slant in the representation of 
performance cannot be ignored. Comparisons among Mr. Gere’s reflections and 
those of the case study and comparison writer illustrate the problem, as does a 
comparison between Mr. Gere’s and Ms. Fleming’s reflections on their practice. 
Mr. Gere must depend on the tenacity of portfolio readers in finding the teaching 


37 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


performance behind the negative slant of the reflection. While portfolio readers 
can likely be trained to score situations like these reliably, especially with 
analytic rubrics that might leave the weighting and combining of evidence to the 
predetermined algorithm, it does not make the ambiguity go away, rather it 
simply masks the problem behind consistent scores that are unlikely to be 
challenged with routine procedures. 

Many of the vignettes raise an important issue that spills over the bounds of 
generalizability: how to evaluate the appropriateness of a teacher's 
practices-the extent to which they are "challenging" for students--based upon 
the evidence contained in the portfolio. Portfolio readers' understanding of 
students' interests, needs, and capabilities depends exclusively on the teachers' 
characterizations and the few artifacts contained in the portfolio (plus readers' 
own mostly unarticulated experience with similar students). How does a reader 
decide whether a teacher's practices are appropriate for her students or 
reflective of inappropriately low expectations perhaps resulting from ability 
groupings imposed by the school? One answer to this question, often heard in 
committee meetings, is that teachers' practices should be evaluated in light of 
their justification of their choices. However, in our experience, writing 
evidence-based justifications or reflections is difficult for many beginning 
teachers. If the quality of the reflection is more crucial to evaluating particular 
kinds of performances, then some teachers may be differentially disadvantaged. 
This leads to the disturbing question of whether it is easier for a teacher to 
receive a higher score with some classes of students than with others. We have 
seen a number of examples in which portfolio readers’ beliefs about the 
appropriateness of different activities lead to different scores on the same 
portfolio. Decisions about appropriateness are often underdetermined by the 
evidence in the portfolio. One could imagine guidelines that ask the teacher to 
provide more evidence of students’ capabilities. In fact, one review panel, noting 
a similar problem, asked for evidence of students' work from more students and 
over a much longer period of time. While this appears to be an appealing 
solution, it risks making the portfolio unmanageable for beginning teachers who 
almost invariably report on the time-consuming nature of the task. Again, there 
are no straightforward solutions to the problem. 

What Can We Conclude about the Generalizability of Teaching Portfolio 
Assessments? 

How might the issues we’ve raised be addressed within the bounds of the 
theoretical resources on generalizability that framed the paper? Returning to the 
questions with which we began this section: What is the assessment domain (or 
“universe”) to which we can safely generalize? What is the (larger) outcome 
domain about which we can reasonably draw inferences supported with logical 
arguments and intermittent empirical studies? How consistent are these 
domains with the domain implied in the decision about licensure? Our answers 
are speculative, based on the evidence provided here and the existing literature 
on performance assessment of teaching. 

What is the assessment domain (or “universe”) to which we can safely 
generalize? 


38 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Readers can certainly be trained to achieve sufficiently reliable scores for 
individual decisions, as the National Board continues to demonstrate. Our 
experience suggests the importance and feasibility of preparing readers to 
forthrightly acknowledge problems of ambiguity in the portfolio evidence. If we 
were rewriting the guiding questions and scoring criteria as a result of this 
experience, we would structure them to encourage careful triangulation across 
different sources of evidence: first, so that higher level inferences could be 
explicitly built up from more descriptive inferences (as we illustrated in our 
comparison methodology) and second, to ferret out disjunctions in the evidence 
that might call inferences about performance levels into question. Assuming that 
this is already enacted informally as part of readers’ training, then it could be 
further supported by the formal procedures and documentation through which 
they record their evaluations. The portfolios so identified could be sent for 
additional review and possibly for additional evidence from the beginning 
teacher. While most large-scale assessment systems are set up to deal with 
unscorable responses, responses on which readers disagree, or responses that 
are flagged as atypical in some way (Wilson and Case, 1997; Engelhard, 1994, 
2002), we imagine that what we are suggesting might well lead to a larger 
proportion of responses being identified as needing additional attention, which 
will, in turn, increase the cost of the system. Having additional evidence of 
patterns of classroom discourse and of students’ learning would be useful and 
might reduce the number of portfolios that need additional review and/or 
evidence. Of course, this would increase the time it takes to evaluate each 
portfolio. It might, however, be possible to develop a multi-stage evaluation 
system that only examines the additional evidence (beyond the featured 
lessons, for instance) when questions arise. That said, it is important to note 
that with the portfolio evidence alone, we have no idea what additional factors 
enabled or constrained the performance, and which of those we would consider 
within and outside the construct-relevant bounds of an appropriate resource. If 
the portfolio asked for such information, it is not clear how it could be 
corroborated or fairly taken into account. 

If we know that a teacher and her students can demonstrate certain kinds of 
performances on at least one occasion, how far beyond inferences about the 
particular portfolio might a well warranted assessment domain expand ? There 
was nothing in our evidence that suggests it would not be possible to include in 
the assessment domain what a teacher can do in this class (not just on this 
occasion) and possibly, what a teacher can do in other classes (perceived as) 
very much like this one. For those comparisons that involved the same class at 
different points in time or very similar classes, the only differences we found that 
seemed to matter could be explained by ambiguous evidence (e.g., disjunction 
between written portfolio and video) or by knowledge that the teachers felt 
obliged to demonstrate activities that were not part of their routine practice. By 
very similar we mean classes that cover the same content and in which 
teachers’ expectations about the students’ capabilities also appeared to be the 
same. We are careful here to limit the inference to what they can do (and 
whether that is replicable), not to what they typically do. However, research on 
classroom culture, which suggests it is always dynamic and at least partially 
unique (e.g., Gallego et al., 2002) does raise red flags about even this 
assumption that should be empirically investigated. Thus, additional research 


39 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


that checks on these assumptions, perhaps with smaller teaching exhibits, or 
with interview/observation cycles like those of PRAXIS III, would be important. If 
feasible, more extended observations would be useful. The advantage of an 
INTASC-like assessment and our more extended case studies is that we see 
how a unit unfolds over a series of lessons. Evidence supporting the 
generalizability of what teacher can do may well need to include such a series of 
lessons. Thus, we speculate it is possible to build a logical argument, that 
should be buttressed with periodic empirical studies, of the extent to which we 
can generalize to an assessment domain that includes classes, subject matter, 
and students like these. It may be necessary to build multiple assessment 
opportunities into the assessment system itself to flag candidates whose 
performance is not consistent. Whether these differences should be treated at 
the first (reliability) or second (transfer) level of generalization described above 
is an open empirical question and then a matter of judgment. 

Beyond this, our evidence, taken together with the general lack of positive 
evidence that might overturn it, suggests that we cannot extend the 
well-warranted assessment domain to different classes within a teacher’s 
regular teaching assignments. Many of the differences we’ve encountered 
suggest variations that may only be supportable at the second level of 
generalizability, as a matter of transfer, if there. Our limited, case based 
qualitative evidence certainly supports concerns that are raised about score 
generalizability that takes task sampling (which always involves some 
differences in context) into account and, indeed, suggests that the problems 
with portfolio assessments of teaching may be much worse because the 
variations in context are so complex. We simply do not know how a teacher 
working with an honors class might perform with a class designated as remedial 
or how a teacher working with a statistics class might perform if the content 
were substantially different. Here additional research is very much needed — not 
only to help appropriately limit inferences about teaching performance but also 
to understand what changes in a teacher’s work context should involve 
additional professional support. Clearly, the domain implied in the decision to 
license a teacher, typically constrained only by general subject (or subjects for 
elementary teachers) and grade or age levels, is greater than can possibly be 
empirically examined and may, at best, involve weak assumptions and negative 
arguments (not yet disconfirmed) of the sort that Kane and colleagues (1999) 
describe. 

Given the limited assessment domain our study suggests is likely supportable, 
is it worth mounting a portfolio assessment of teaching? We do believe that the 
answer is still yes: If, given adequate time and resources, a teacher is not able 
to demonstrate in one instance a passing performance, then it makes sense to 
require additional opportunities for professional development and further 
demonstration of competence before granting a regular license. This is 
information about teaching performance that would not be available to a state 
using only written tests. But is there more we can expect from a performance 
assessment system? 

Returning to First Principles 

How does a state education authority, charged with ensuring the competence of 


40 of 70 


7/19/2004 12:06 PM 



EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


the teaching force, undertake that task in a well warranted way? A recent NRC 
report on Testing Teacher Candidates raised concerns about licensure 
decisions based only on tests of basic skills and content knowledge (which, 
themselves, the report notes, have only limited validity evidence) (Note 16) 
(NRC, 2001b). The authors of the report called, in addition, for: 

Research and development of broad-based indicators of teacher 
competence, not limited to test-based evidence, should be 
undertaken; indicators should include assessments of teaching 
performance in the classroom, of candidates' ability to work 
effectively with students with diverse learning needs and cultural 
backgrounds and in a variety of settings, and of competencies that 
more directly relate to student learning, (p. 172). 

Given our conclusions from the previous section and the existing literature on 
performance assessment, this is a tall order. How can information about these 
sorts of performances be reasonably taken into account by distant users? 

When distant users have access to a classroom portfolio like the one we studied 
here, they certainly know something more about a teacher’s practice that they 
did before. The question is how to use that information or rather what kind of 
system can feasibly be developed to support a valid and ethical use of that 
information. If we theorize this problem within the bounds of conventional 
approaches to generalizability, then our choices for how to improve the 
assessment system are limited. As Mislevy and colleagues (2002) noted 
“compromises in theory and methods ... result when we have to gather data to 
meet the constraints of specific models” (p. 49; see also NRC, 2001a): “When 
unexamined standard operating procedures fall short, it is often worth the effort 
to return to first principles” (Mislevy et al., 2003, mss. p. 57). 

In addressing this issue, it is important to illuminate the distinction between (a) 
warranting the validity of the interpretation of a score across individuals with the 
same score (as psychometrics is positioned to do) and (b) warranting the 
validity of a consequential decision about an individual (which may be informed 
by a valid score but typically relies on other/additional kinds of evidence and 
judgments). That these two sorts of warrant can be different is not a radical 
suggestion, even within the discourse of educational and psychological 
measurement. As the testing Standards assert: "In educational settings, a 
decision or characterization that will have major impact on a student should not 
be made on the basis of a single test score. Other relevant information should 
be taken into account if it will enhance the overall validity of the decision" (p. 
146). (Note 17) Citing the example of identifying students with special needs, 
the authors of the Standards note: "It is important, that in addition to test scores, 
other relevant information (e.g., school record, classroom observation, parent 
report) is taken into account by the professionals making the decision" (p. 147). 
And yet, psychometrics has little advice to offer about how to combine such 
evidence into a well warranted interpretation or decision. 

We close, then, by pointing in two somewhat different (and yet potentially 
complementary) directions for returning to first principles to enhance the validity 
of high-stakes assessment of teaching competence: (a) toward the flexible use 
of probability based reasoning illustrated in work of Mislevy, Wilson, and their 


41 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


colleagues (Mislevy et al., 2002, 2003; NRC, 2001a; Wilson and Sloane, 2000; 
Wilson, 1994) and (b) toward enhancing the capability of local education 
authorities, in dialogue with the state, to make well warranted and credible 
recommendations about individual teachers. 

Mislevy, Wilson and colleagues argued that fundamental concepts like validity, 
reliability, and fairness, are broader than any particular set of methods for 
addressing them: while “familiar formulas and procedures from test theory” work 
well with “familiar forms of assessment,” (p. 1) they risk constraining new forms 
of assessment that respond to new developments in our understanding of how 
people learn. For instance, citing Brennan’s association of reliability with 
replication (2001), they noted that “it is less straightforward to know just what 
repeating the measurement procedure means if the procedure has several 
steps that could each be done differently ... or if some of the steps can’t be 
repeated at all (if a person learns something by working through a task, a 
second attempt isn’t measuring the same level of knowledge)” (p. 17) (issues 
Brennan acknowledges). They offered a more general characterization of 
reliability as “the evidentiary value that a given ...body of data would provide for 
a claim — more specifically, the amount of information for revising belief about an 
inference” (p. 33). 

They cited a number of alternative theoretical models within and beyond 
psychometrics which taken together enhance our capability to model real world 
situations in reasoning from evidence to inference. Note 18 They noted further 
that each of these might be considered a special case of a more encompassing 
approach to probability based reasoning that would allow mixing of existing 
models and development of new ones. Note 19 Given the sorts of examples they 
provided, we imagine that models like these would enable distant users to 
combine portfolio based evidence with other evidence available about the 
teacher, including routinely collected evidence from existing tests, and briefer 
embedded assessments that might be collected during a teachers’ preservice 
and induction years. While none of these could be considered interchangeable 
or random samples from the same assessment domain (as common 
approaches to generalizability idealize), each of them would help in decreasing 
uncertainty about a teacher’s competence (or accomplishment). Indeed, a 
consortium of teacher education institutions in California (Performance 
Assessment for California Teachers [PACT]) is exploring the use of preservice 
embedded assessments and induction-year teaching exhibits (similar to the 
INTASC portfolio exhibits) as an alternative to the state’s less-contextualized 
assessment. Mislevy, speaking about assessment of students’ opportunity to 
learn, envisioned that models can be developed which simultaneously take into 
account important features of the context in which the assessment occurs 
(personal communication, 12/28/02). The goal in addressing the qualities of 
reliability and validity is to increase “the fidelity of probability-based models to 
real-world situations” (p. 49). While it is beyond the scope of this paper and our 
collective expertise to proceed much further down this road, we point readers in 
the direction of these scholars’ work to highlight the possibility of developing 
more flexible models with large-scale centralized forms of assessment. 

An alternative (and we argue, complementary) direction for returning to first 
principles involves enhancing the capability of local authorities (districts, teacher 


42 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


education institutions) to make well warranted (and audited or auditable) 
recommendations to the state about the readiness of individual teachers to 
receive a regular license. Portfolio judgments could be combined with other 
relevant sorts of evidence only routinely available in the local context. This 
approach suggests different roles for state and local agencies; and a different 
use for the portfolio at the state level than at the local level. At the state level, 
the goal would be to audit the practices and judgments about individual 
candidates that are made at the local level. 

To warrant those decisions, we need to move beyond psychometrics or 
(frequentistic) probability-based reasoning (Note 20) and look to other 
epistemological/ethical resources (for instance, in anthropology, hermeneutic 
philosophy, political philosophy and ethics, and the law). The senior author has 
turned to hermeneutic philosophy--for reasoning from evidence to inference--as 
a means of warranting knowledge claims and ethical decisions (Note 21) (Moss, 
1994, 1996, 1998, in press; Moss and Schutz, 2001, Moss, Schutz, and Collins, 
1998). Practices for developing interpretations across disparate sources of 
evidence and controlling readers’ biases can be found in any number of 
“qualitative” methods texts (e.g., Erickson, 1986). 

Of course, when portfolios are constructed and evaluated within a local 
educational community, a somewhat different set of threats (and benefits) to 
validity becomes salient. On the positive side, when candidates have multiple 
opportunities to demonstrate their learning, capabilities, or accomplishments, 
the stakes for any one assessment decision are reduced. This is particularly 
true when support for professional development is provided in between 
assessment episodes. Similarly, when assessors can seek additional 
information to help in explaining the observed performance, as is true in many 
local contexts, the burden placed on interpreting the portfolio evidence is 
reduced. On the negative side, an ongoing relationship between the candidate 
and the readers can detract from validity by allowing potentially irrelevant 
knowledge and commitments to be brought to bear on the conclusions about the 
candidate's performance. It will be important to bring in outside perspectives to 
the evaluation process, so that potential disabling biases of readers familiar with 
the candidate and context (whether favorable or unfavorable to the candidate) 
can be illuminated and self-consciously considered. Readers designated by the 
state could be invited to participate in the process in a variety of ways-as 
members of the initial portfolio review team; as auditors of the decision, written 
documentation, and supporting evidence produced by the team; or as 
independent reviewers who consider the portfolio based evidence with no 
knowledge of the outcome. Although consistency in the conclusions of inside 
and outside reviewers will enhance the validity of the decision, high levels of 
consistency may be unlikely because of the differing perspectives and 
knowledge that the different readers bring. Thus, it becomes the state’s role to 
audit and warrant the process at the local level, not necessarily the individual 
decision. Here, the role of the outsider is to illuminate taken-for-granted 
practices and perspectives and make them available for critical review by 
members of the interpretive community so that they may be self-consciously 
affirmed or revised. In that way, the interpretive community continues to evolve 
in its ability to make sound judgments. Alverno College (NRC, 2001b; Zeichner, 
2000) provides one well-documented model of a local system set up to support 


43 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


professional development and warrant high stakes decisions. Descriptions of 
other examples of local decision making processes can be found in Porter, 
Youngs and Odden (2003), NRC (2001b), and Lyons (1998). 

We believe both of these approaches-one focused on the warrant for 
centralized decisions and the other on the warrant for local decisions--might 
enhance the way in which evidence of teaching performance can be taken into 
account in licensure decisions. Each has advantages and disadvantages, 
resolving some validity problems and raising others. Whichever approach is 
privileged in a given educational context (that is, whichever approach results in 
the decision that “counts”), the other approach can (and should) provide an 
important check on or challenge to the validity of those decisions. 

In closing, we concur with the National Research Council's (2001b) call for a 
wider range of assessment practices than is typically gathered at the state level, 
including evidence of teaching performance. The work of the National Board and 
of INTASC and Connecticut suggests that portfolios represent one feasible 
means for obtaining information about teachers’ classroom performance. More 
research is needed, however, to unearth potential problems with portfolio or 
other performance-based interpretations and to provoke debate about solutions. 
Further, it is important to note that the dilemmas we have raised do not simply 
reflect technical problems. And the solutions, we believe, are not within the 
bounds of a single assessment program. Rather, teacher education institutions 
and the schools and districts within which teachers work must work together to 
support beginning teachers, especially as they move into new contexts, and to 
ensure that they are ready to provide a productive learning environment for all of 
their students. 


Table 1 

Participating ELA Teachers* and Their Classes 


(with achievement levels as characterized by teachers) 

Portfolio/Case Comparisons 



Primary Portfolio Class(es) 

Non-Portfolio Class(es) 

Ms. 

Bertram 

6th grade language 
artsfheterogeneous”) 

6th grade language 
artsfheterogeneous”) 

Mrs. 

Carson 

8th grade Englishfadvanced” “best 
math students”) 

8th grade Englishfbest 
math students”) 

Mr. 

Koehler 

8th grade ELAftechnically 
heterogeneous” but mostly “upper 
level” due to math track) 

8th grade ELA( 
“heterogeneous”) 

Mrs. 

Martin 

9th grade writing enrichment 

9th grade writing 
enrichment 


44 of 70 


7/19/2004 12:06 PM 




EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Mr. 

Richards 

9th Grade honors 

9th Grade third level 
(“content is watered down”) 

Mr. 

Roberts 

College English 9 (“average”) 

College English 9 
(“average”) 

Mr. 

Roosevelt 

12th grade Humanities II honors 
anc/IOth grade “general” 

10th grade honors 

Mr. Turner 

1 1th grade English III (“college 
level”) 

9th grade honors 


Portfolio/Portfolio Comparisons 



Primary Portfolio Class 

Secondary Portfolio Class 

Mrs. Harris 

10th grade English II 
(“heterogeneous, vocational”) 

10th grade English II 
(“heterogeneous, vocational”) 

Mrs. 

Jacobson 

7th grade Language Arts 
(“heterogeneous,” “phase II 
math”) 

7th grade Language Arts 
(“heterogeneous, “lower ability 
math”) 

Mrs. Marks 

9th grade World Literature 1 
(“top level”) 

10th grade World Literature II 
(“average track”) 

Ms. Patrick 

middle school 

middle school 

Ms. 

Phillips 

7th grade English (“gifted and 
talented”) 

7th grade English (“B-level”) 

Ms. 

Snyder 

7th grade English (“many 
reading 1 - 2 levels below 
grade level”) 

7th grade English (“many 
reading 1-2 levels below grade 
level”) 


*AII names are pseudonyms. 


Table 2 

Participating Mathematics Teachers* and Their Classes 

(with achievement levels as characterized by teachers) 

Portfolio/Case Comparisons 



Primary Portfolio Class(es) 

Non-Portfolio Class(es) 

Ms. 

Anderson 

Algebra 1 primarily 9th graders 

Accelerated Algebra 1 all 9th 
graders 


45 of 70 


7/19/2004 12:06 PM 




EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edn/epaa/vl2n32/ 


Ms. 

Fleming 

Integrated Geometry 1 1th 
graders 

Math 1 remedial math (“failing 
students”) 

Mr. Gere 

Algebra 1 9th graders 

Pre-Calculus predominantly 
juniors, a few 10th graders. 

Mrs. 

Green 

Geometry (academic) 10th, 
1 1th and 12th graders 

Integrated Math remedial math 
(“socially promoted students”) 

Mrs. 

Jones 

Geometry (college bound) 
10th and 1 1th graders 

Algebra 1 (college bound) 9th 
graders 

Ms. 

Rinaldi 

General 8th grade 
mathematics 

Accelerated Geometry All 8th 
grade students (most 
accelerated math students in the 
school) 

Mr. 

Skinner 

Pre-Calculus (optional 4th 
year of mathematics) 12th 
graders 

Algebra II 1 1th and 12th grade 
(students of varying abilities) 

Ms. 

Weaver 

Algebra II (primarily college 
bound students, some 
repeating the course) 10th, 
1 1th, and 12th graders 

Geometry (majority of the 
students are college bound, but 
not honors students) - 10th, 

1 1th, and 12th graders 


Portfolio/Portfolio Comparisons 



First Portfolio Class 

Second Portfolio Class 

Ms. 

Barnes 

Algebra and Geometry 8th 
graders (general ability) 

Algebra 8th graders (most 
advanced math class offered) 

Ms. 

Eastman 

Trigonometry/ Analytic 
Geometry 1 1th and 12th 
graders (general ability, high 
achieving students) 

Consumer Mathl 0th, 11th, and 
12th graders (many repeating 
the course or have previously 
failed a math course) 

Mr. 

Freeman 

Integrated Level 1 (Algebra) 9th 
graders (advanced) 

Geometry 9th and 10th graders 
(advanced) 

Mr. 

Johnson 

Transition Math 8th graders 
(general ability) 

Algebra 8th graders (average 
ability) 

Ms. 

Layton 

Transition to College 
Mathematics 12th graders 
(average ability) 

Algebra 1 - part 2 9th, 10th, 

1 1th, and 12th graders (average 
ability) 

Ms. 

Schafer 

Regular/Inclusion Math 7th 
grade 

Regular Education Mathematics 
7th grade 


46 of 70 


7/19/2004 12:06 PM 




EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 



8th grade 

mathematics(students of 
varying abilities, most advanced 
students leave this classroom to 
take algebra) 


7th grade mathematics(students 
of varying abilities) 


*AII names are pseudonyms. 


Notes 

• 1 . This study was supported, in part, by a grant from the Spencer 
Foundation. We gratefully acknowledge their support. We also wish to 
thank Kevin Basmadjian, Leslie Burns, Leah Kirell, Suzanne Knight, Emily 
Smith, Michigan State University, and Vilma Mesa, University of Michigan, 
for their thoughtful contributions to the comparative analyses. The senior 
author is a member of INTASC’S technical advisory committee (TAC). We 
gratefully acknowledge comments on an earlier draft from Aaron Schutz, 
Mark Wilson and from INTASC staff and TAC members: Mary Diez, Jim 
Gee, Ann Gere, Bob Linn, Jean Miller, David Paradise, and Diana Pullin. 
Opinions, findings, conclusions, and recommendations expressed are 
those of the authors and do not necessarily reflect the views of INTASC, 
its technical advisory committee, or its participating states. 

• 2. Kane and colleagues (1999) actually refer to three levels of inference: 
“inferences from performances to observed scores”, “inferences [or 
‘generalization’] from observed scores to universe scores... which includes 
performances on tasks similar to (i.e., exchangeable with) those in the 
assessment”,” and “inferences [or ‘extrapolation’] from universe scores to 
target scores” reflecting “a larger, and generally less well-defined domain” 
where the regulative ideal of random sampling is untenable. Messick’s 
(1994) distinction between task- and construct- based assessments 
seems to parallel Kane’s first two levels of inference. Haertel (1985), too, 
characterizes three levels of generalizability. He differentiates the outcome 
domain into that part that can be empirically investigated and that part that 
involves only weak assumptions. 

• 3. Brennan notes a discontinuity between IRT and other measurement 
models: “It is certainly true that statistics that have a reliability like form 
can be computed based on an IRT analysis, but it is equally true that 
almost all such analyses treat items as fixed. This raises important 
questions about what such statistics mean from the point of view of 
replications of measurement procedures” (2001, p. 304). 

• 4. Brennan also cites scoring rubrics and rater training procedures as 
potentially relevant sources of variation, although most assessments treat 
these (within the universe of generalization) as fixed. 

• 5. There is a long history in the writing assessment literature of examining 
relationships between so called direct and indirect methods of 
assessment, both to document the validity of the multiple choice method 
and to show that actual samples of writing represent a different construct 
than what can be examined with multiple choice tests. 


47 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


• 6. Readers should note that the structure of the National Board 
assessment is undergoing revision; the description presented here was 
operative in each of the research studies we describe. See www.nbpts.org 
for updated information. 

• 7. For the six certificates that were operational when their Technical 
Analysis Report was released (1998), the overall estimate of exercise 
reliability (across the 10 tasks) ranged from 0.72 - 0.87, with a median of 
0.825. This included 4 in-class portfolio exercises, two documented 
accomplishments portfolio exercises, and six assessment center 
exercises. For the four in-class portfolio exercises, the reliability ranged 
from .049 - 0.76, with a median of 0.695. [The one math certificate, 
adolescence and Young Adulthood/Math received the highest exercise 
reliabilities and Early Adolescence Generalist, the lowest.] (p. 109). 
Decision consistency for exercise sampling ranged from 5% - 7% for false 
negatives (estimated percent of candidates who incorrectly failed) and 6% 
- 9% for false positives (estimated percent of candidates who incorrectly 
passed). They note that decisions are more consistent for candidates with 
scores further from the cut score. Assessor reliability was generally high 
(.90 - .98) overall and (.85 - .92) on in class portfolio exercises. “Based on 
these analyses of the technical measurement quality of the six certificates 
administered in the 1996-97, the assessments fully meet the requirements 
of the Standards for Educational and Psychological Testing (AERA, APA, 
NCME, 1985) for validity, reliability, and freedom from bias” (p. 125). 

• 8. The use of guiding questions that integrate standards into dimensions 
directed at a particular teaching performance to produce an interpretive 
summary was developed by Genette Delandshere, Steve Koziol, Penny 
Pence, Ray Pecheone, Tony Petrosky, and Bill Thompson in their 
leadership of one of the first two National Board Assessment development 
labs. This has informed both the work at INTASC and NBPTS. See, for 
instance Delandshere and Petrosky (1994); Koziol, Burns and Brass 
(2003). 

• 9. With these independent readings, the rank ordering of the two records 
of teaching were the same. Given the time consuming nature of the task, 
we decided it was appropriate to move to a single comparison document 
and audit described next rather than to have two separate documents 
produced. 

• 10. While the four separate classroom based exercises in the National 
Board portfolio may (or may not) encompass some of these variations 
depending on the class(es) the teacher chooses for each exercise, they 
are thoroughly confounded with task differences and not to the best of our 
knowledge routinely examined. 

• 1 1 . All names are pseudonyms. 

• 12. Connecticut's ELA guiding questions and rubrics have been revised 
since we used them. The current versions are available on the 
Connecticut Department of Education’s web page at 
http://www.state.ct.us/sde/dtl/t-a/best/portfolio/rubrics.htm . 

• 13. Connecticut's Math guiding questions and rubrics have been revised 
since we used them. The current versions are available on the 
Connecticut Department of Education’s web page at 
http://www.state.ct.us/sde/dtl/t-a/best/portfolio/rubrics.htm. 

• 14. Direct quotations from the teachers are in italics. 


48 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


• 15. The National Board assessments encourage but do not require 
teachers to choose different classes for different tasks (see guidelines at 
www.nbtps.org). We have not located any documentation that provides 
evidence about the ways in which teachers respond to this direction, 
whether/how they are considered during scoring, or how these differences 
might shape the evaluations of their performances. 

• 16. "Little information abut the technical soundness of teacher licensure 
tests appears in the published literature. Little research exists on the 
extent to which licensure tests identify candidates with the knowledge and 
skills necessary to be minimally competent beginning teachers" (NRC, 
2001b, p. 14). 

• 17. Of course, the import of this statement depends on what your 
conception of validity is. 

• 18. These include item response theory models, latent class models, 
structural equation models, and hierarchical models. 

• 19. “In the same conceptual framework and with the same estimation 
approach, we can carry out probability-based reasoning with all of the 
models we have discussed. Moreover, we can mix and match components 
of these models, and create new ones, to produce models that correspond 
to assessment designs motivated by theory and purpose” (Mislevy et al., 
2001, p. 49). 

• 20. Kadane and Schum (1996), a resource on which Mislevy (1994) 
draws, cite subjective versions of probabilistic reasoning that could be 
used with singular judgments. They use this approach to model the 
likelihood of potential verdicts in a murder trial. They offer this approach as 
a supplement, a way of illuminating assumptions behind, but not a 
replacement to the kinds of human judgments involved in such complex 
social situations. 

• 21. Like psychometrics, hermeneutics characterizes a general approach to 
the interpretation of human products, expressions, or actions. Important 
differences between these disciplines lie, in part, in the ways in which the 
information is combined. Psychometric practices support aggregative 
strategies for combining information: scores for distinct (ideally 
independent) pieces of information are (weighted and) aggregated to form 
an interpretable overall score or grade. Hermeneutics supports a holistic 
and integrative approach to interpretation of human phenomena, which 
seeks to understand the whole in light of its parts, repeatedly testing 
interpretations against the available evidence, until each of the parts can 
be accounted for in a coherent interpretation of the whole (Bleicher, 1980; 
Ormiston and Schrift, 1990; Schmidt, 1995). 

• 22. Quotations marks (“ ”) indicate quotations from the beginning teacher. 
Side-ways carats (<>) indicate quotations from the case study writer. All 
names are pseudonyms. 


References 

AERA, APA, & NCME (1985). Standards for educational and psychological testing. Washington, DC: 
American Psychological Association. 

AERA, APA, & NCME (1999). Standards for educational and psychological testing. Washington, DC: 
American Educational Research Association. 


49 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Ball, D. L., Gere, A. R. & Moss, P. A. (1998). Fieldwork Guide for the INTASC Beginning Teacher 
Case Study Project. Unpublished manuscript, University of Michigan. 

Bleicher, J. (1980). Contemporary hermeneutics: Hermeneutics as method, philosophy, and critique. 
London: Routledge and Kegan Paul. 

Bond, L., Smith, T., Baker, W.K., & Hattie, J.A. (2000). The certification system of the National Board 
for Professional Teaching Standards: A construct and consequential validity study. Greensboro, 
NC: Center for Educational Research and Evaluation, The University of North Carolina at 
Greensboro. 

Breland, H. M., Camp, R., Jones, R. J., Morris, M. M., & Rock, D. A. (1987). Assessing writing skill 
(Research Monograph No. 11). New York: College Entrance Examination Board. 

Brennan, R. L. (1983). The elements of generalizability theory. Iowa City, IA: American College 
Testing Program. 

Brennan, R. L. (2001). An essay on the history and future of reliability from the perspective of 
replications. Journal of Educational Measurement, 38(4), 295-317. 

Brennan, R., & Johnson, E. (1995). Generalizability of performance assessments. Educational 
Measurement: Issues and Practice, 14(4), 9-12, 27. 

Crehan, K. D. (2001). An investigation of the validity of scores on locally developed performance 
measures in a school assessment program. Educational and Psychological Measurement, 61(5), 
841-848. 

Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer (Ed.), Test validity (pp. 
3-17). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. 

Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed .), Intelligence: 
Measurement, theory and public policy (pp. 147-171). Urbana: University of Illinois Press. 

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N (1972). The dependability of behavioral 
measurements: Theory of generalizability for scores and profiles. New York: Wiley. 

Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. H. (1997). Generalizability analysis for 
performance assessments of student achievement or school effectiveness." Educational and 
Psychological Measurement, 3(57), 373-399. 

Delandshere, G., & Petrosky, A. R. (1994). Capturing teachers' knowledge. Educational Researcher, 
23(5), pp. 11-18. 

Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of 
performance assessments. Applied Measurement in Education, 4(4), 289-303. 

Dwyer, C. A. (1998). Psychometrics of Praxis III: Classroom performance assessments. Journal of 
Personnel Evaluation in Education, 12(2), 163-187. 

Educational Testing Service (ETS) (1998). NBPTS technical analysis report, 1996-97 administration. 
Southfield, Ml: NBPTS 

Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many 
faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. 

Engelhard, G. ( 2002). Monitoring raters in performance assessments. In G. Tindal & T. M. Haladyna 
(Eds.), Large scale assessment programs for all students (pp. 261-288). Mahwah, NJ: Erlbaum. 

Erickson, F. (1986). Qualitative methods in research on teaching. In M. C. Wittrock (Ed.), Handbook 
of research on teaching (pp. 119-161). New York: Macmillan. 

Gallego, M. A., Cole, M., & The Laboratory of Human Cognition (2002). Classroom cultures and 
cultures in Classrooms. In V. Richardson (Ed.), Handbook of Research on Teaching (4th Ed.) 

(pp. 951-997). Washington: AERA. 

Gao, X., & Colton, D. A. (1997). Evaluating measurement precisions of performance assessment with 


50 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


multiple forms, raters, and tasks. In D. A. Colton (Ed.), Reliability issues with performance 
assessments: A collection of papers. Iowa City, American College Testing Program. (ACT 
Research Report Series 97-3). 

Gao, X., Shavelson, R. J., & Baxter, G. P. (1994). Generalizability of large-scale performance 
assessments in science: Promises and problems. Applied Measurement in Education , 7(4), 
323-342. 

Haertel, E. (1985). Construct validity and criterion-referenced testing. Review of Educational 
Research , 55, 23-46. 

Haertel, E. H. & Lorie, W. (in press). Validating Standards-Based Test Score interpretations. 
Measurement: Interdisciplinary Research and Perspectives. 

Harris, D. J. (1997). Using reliabilities to make decision. In D. A. Colton (Ed.), Reliability issues with 
performance assessments: A collection of papers. Iowa City, American College Testing Program. 
(ACT Research Report Series 97-3). 

Interstate New Teacher Assessment and Support Consortium. (1992). Model standards for beginning 
teacher licensing and development: A Resource for State Dialogue. Washington, DC: Council of 
Chief State School Officers. 

Jaeger, R. M. (1998). Evaluating the psychometric qualities of the National Board for Professional 
Teaching Standards' Assessments: A methodological accounting. Journal of Personnel 
Evaluation in Education, 2(2), 189-210. 

Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence. 

New York: Wiley. 

Kane, M., Crooks, Terence, & Cohen, Allan (1999). Validating measures of performance. Educational 
Measurement: Issues and Practice, 18(2), 5-17. 

Knapp, M. & S. Woolverton. (1995). Social class and schooling. In James Banks & Cherry A. McGee 
Banks (Eds.), Handbook of Research on Multicultural Education (pp. 548-569). New York: Simon 
and Schuster. 

Klein, S. P., McCaffrey, D., Stecher, B., & Koretz, D. (1995). The reliability of mathematics portfolio 
scores: Lessons from the Vermont experience. Applied Measurement in Education, 8 (3), 
243-260. 

Koretz, D., McCaffrey, D., Klein, S., Bell, R., & Stecher, B. (1992). The reliability of scores from the 
1992 Vermont Portfolio Assessment Program : Interim report. Santa Monica, CA: Rand Institute 
on Education and Training, National Center for Research on Evaluation, Standards, and Student 
Testing. 

Koretz, D., Klein, S., McCaffrey, D., & Stecher, B. (1993). Interim report: The reliability of Vermont 
portfolio scores in the 1992-93 school year ( RAND, RP-260). Santa Monica, CA: RAND. 
(Reprinted from CSE Technical Report 370, Los Angeles, University of California, Center for 
Research on Evaluation, Standards, and Student Testing, December.) 

Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (in press). The Vermont Portfolio Assessment 
Program: Findings and implications. Educational Measurement: Issues and Practice, 13(3), 5-16. 

Koretz, D., Stecher, B., Klein, S., McCaffrey, D., & Deibert, E. (1993). Can portfolios assess student 
performance and influence instruction? The 1991-92 Vermont experience ( RAND, RP-259). 

Santa Monica, CA: RAND. (Reprinted from CSE Technical Report 371, Los Angeles, University 
of California, Center for Research on Evaluation, Standards, and Student Testing, December.) 

Koziol, S. M. Jr., Burns, L., & Brass, J (2003). Four lenses for the analysis of teaching. Supporting 
beginning teachers’ practice. Working paper, Michigan State University. 

Lane, S.. Liu., M., Ankenmann, R. D., & Stone, C. A. (1996). Generalizability and validity of a 
mathematics performance assessment. Journal of Educational Measurement, 33(1), 71-92. 

Linn, R. L. & Burton, E. (1994). Performance-based assessment: Implications of task specificity. 
Educational Measurement: Issues and Practice, 13(1), 5-8, 15. 


51 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Lyons, N. (1998) With portfolios in hand: validation the new teacher professionalism. New York: 
Teachers College Press. 

McBee, M. M., & Barnes, L. L. (1998). The generalizability of a performance assessment measuring 
achievement in eight-grade mathematics. Applied Measurement in Education, 1 1(2), 179-194. 

McLaughlin, M., Talbert, J., & Bascia, N. (Eds.). (1990). The contexts of teaching in secondary 
schools: teachers' realities. New York: Teachers College Press. 

McLaughlin, M. & Little, J. W. (Eds.). (1993). Teachers' work: Individuals, colleagues, and contexts. 
New York: Teachers College Press. 

McNeil, L. (1983). Contradictions of control: School structure and school knowledge. New York: 
Routledge and K. Paul. 

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New 
York: American Council on Education/Macmillan. 

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance 
assessments. Education Researcher, 32(2), 13-23. 

Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241-256. 

Mislevy, R.J. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 
439-483. 

Mislevy, R. J., Almond, R., & Steinberg, L. (2003). On the structure of educational assessment. 
Measurement: Interdisciplinary Research and Perspectives, 1 (1), pp. 3-62. 

Mislevy, R. J., Wilson, M. R., Ercikan, K., Chdowsky, N. (2002). Psychometric principles in student 
assessment. Los Angeles: CRESST. 

Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for 
performance assessment. Review of Educational Research, 62 (3), 229-258. 

Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5-12. 

Moss, P. A. (1996). Enlarging the dialogue in educational measurement: Voices from interpretive 
research traditions. Educational Researcher, 25 (1), 20-28, 43. 

Moss, P. A. (1998). The role of consequences in validity theory. Educational Measurement: Issues 
and Practice, 17 (2), 5-12. 

Moss, P. A. (in press). The meaning and consequences of reliability. Journal of Educational and 
Behavioral Statistics. 

Moss, P. A., Rex, L., & Geist, P. (2000a). Case Study Writing Guide for the INTASC Beginning 
Teacher Case Study Project. Unpublished manuscript, University of Michigan. 

Moss, P. A., Rex, L., & Geist, P. (2000b). Fieldwork Guide for the INTASC Beginning Teacher Case 
Study Project. Unpublished manuscript, University of Michigan. 

Moss, P. A. & Schutz, A. (1999). Risking frankness in educational assessment. Phi Delta Kappan, 
80(9), 680-687. 

Moss, P. A. & Schutz, A. (2001). Educational standards, assessment, and the search for 
"consensus". American Educational Research Journal, 38 (1), 37-70. 

Moss, P. A., Schutz, A. M., & Collins, K. M (1998). An Integrative approach to portfolio evaluation for 
teacher licensure. Journal of Personnel Evaluation in Education, 12(2), 139-161 . 

Moss, P. A., Schutz, A. M., Haniford, L., Miller, R., & Coggshall, J. (in preparation). High stakes 
assessment as ethical decision making. Unpublished manuscript, University of Michigan. 

Myford, C. M. (1993). Formative studies of Praxis III: Classroom Performance Assessments-An 
overview. The Praxis Series: Professional Assessments for Beginning Teachers. Princeton, NJ: 


52 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Educational Testing Service. 

Myford, C. M., & Engelhard, G. (2001). Examining the psychometric quality of the national board for 
professional teaching standards early childhood/generalist assessment System. Journal of 
Personnel Evaluation in Education , 15(4), 253-285. 

Myford, C. M., & Mislevy, R. J. (1995). Monitoring and improving a portfolio assessment system 
(Center for Performance Assessment Research Report). Princeton, NJ: Educational Testing 
Service. 

National Research Council (2001a). Knowing what students know: The science and design of 
educational assessment. Washington, D.C.: National Academy Press. 

National Research Council (2001b). Testing teacher candidates: The role of licensure tests in 
improving teacher quality. Washington, D.C.: National Academy Press. 

Nystrand, M., Cohen, A. S., & Martinez, D. (1993). Addressing reliability problems in the portfolio 
assessment of college writing. Educational Assessment, 1(1), 53-70. 

Ormiston, G. L. & Schrift, A. D. (Eds.) (1990). The hermeneutic tradition: From Ast to Ricoeur. 

Albany: SUNY Press. 

Pearlman, M. (in press a). The design architecture of NBPTS certification assessments. In L. 
Ingvarson (Ed.), Assessing teachers for professional certification. Stamford, CT : Jai. 

Pearlman, M. (in press b). The evolution of the scoring system for NBPTS assessments. In L. 
Ingvarson (Ed.), Assessing teachers for professional certification. Stamford, CT: Jai. 

Porter, A. C., Youngs, P. & Odden, A (2003). Advances in teacher assessments and their uses. In V. 
Richardson (Ed.), Handbook on Research on Teaching (pp. 259-297). Washington, DC: AERA. 

Reckase, M. D. (1995). Portfolio assessment: A theoretical estimate of score reliability. Educational 
Measurement: Issues and Practice, 14(1), 12-14, 31. 

Schmidt, L. K, (1995). Introduction: Between certainty and relativism. In. L. K. Schmidt (Ed.), The 
specter of relativism: Truth, dialogue, and phronesis in philosophical hermeneutics (pp. 1-22). 
Evanston, IL: Northwestern University Press. 

Schutz, A & Moss, P. A. (in press), Reasonable decisions in portfolio assessment. Educational Policy 
Analysis Archives. 

Shavelson, R. J., Baxter, G. P., Pine, J., Yure, J., Goldman, S.R., Smith, B. (1991). Alternative 
assessment technologies for large scale science assessment: Instrument of education reform. 
School effectiveness and school improvement, 2(2), 97-114. 

Shavelson, R. J., & Webb, N. W. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGE 
Publications. 

Stodolsky, S. & Grossman, P. (1995). The impact of subject matter on curricular activity: An analysis 
of five academic subjects. American Educational Research Journal, 32(2), 227-49. 

Swanson, D., Norman, G. R. & Linn, R. L. (1995). Performance-based assessment: Lessons from the 
health professions. Educational Researcher, 24(5), 5-11,35. 

Wilson, M. (1994). Community of judgment: A teacher-centered approach to educational 

accountability. In Office of T echnology Assessment (Ed.), Issues in Educational Accountability. 
Washington, D.C.: Office of Technology Assessment, United States Congress. 

Wilson, M., & Case, H. (1997). An examination of variation in rater severity over time: A study in rater 
drift. Berkeley, CA: Berkeley Evaluation and Assessment Research (BEAR) Center. 

Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. 
Applied Measurement in Education, 13(2), 181-208. 

Zeichner, K. (2000). Alverno College. In L. Darling Hammond (Ed.), Studies in excellence in teacher 
education: Preparation in the undergraduate years. Washington, DC: American Association of 
Colleges of Teacher Education 


53 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


About the Authors 

Pamela A. Moss is an Associate Professor in the School of Education at the 
University of Michigan. Her areas of specialization are at the intersections of 
educational assessment, validity theory, and interpretive social science. She can 
be reached at 4220 School of Education, University of Michigan, Ann Arbor, Ml 
48109-1259 (pamoss@umich.edu). 

LeeAnn M. Sutherland is an Assistant Research Scientist at the University of 
Michigan (lsutherl@umich.edu). Her work focuses on adolescent literacy and 
identity, particularly as students make sense of school discourse vis-a-vis their 
everyday experiences. 

Laura Haniford is a doctoral candidate in Educational Foundations and Policy 
at the University of Michigan (lhanifor@umich.edu). She specializes in teacher 
education, especially multicultural education. Her research focuses on the ways 
in which classroom discourse influences learning opportunities. 

Renee Miller is a doctoral candidate in the School of Education at the University 
of Michigan (reneelm@umich.edu). She specializes in Science Education and 
Museum Studies. 

David Johnson is a Ph.D. student in Educational Studies at the University of 
Michigan (djjohnso@umich.edu). His research interests include the influence of 
government policy on teachers and students and how students make meaning 
of their state-mandated testing experiences. 

Pamela K. Geist is an educational consultant in Denver, CO 
(pamgeist@TEG-Global.com). She specializes in mathematics education. 

Stephen M. Koziol, Jr. is Professor and Chair of the Department of Curriculum 
and Instruction at the University of Maryland (skoziol@umd.edu). He specializes 
in English Education, program design and policy in teacher education, and 
teacher assessment. 

Jon R. Star is an Assistant Professor in the College of Education at Michigan 
State University (jonstar@msu.edu). His research focuses on students 1 learning 
of middle and secondary school mathematics, particularly the development of 
mathematical understanding in algebra. 

Raymond L. Pecheone is an Academic Research and Program Officer in the 
School of Education at Stanford University (raymond.pecheone@stanford.edu). 
He specializes in the design and implementation of complex performance 
assessment systems. Previously Dr. Pecheone was Bureau Chief of Curriculum, 
Research, Testing and Assessment for the Connecticut State Department of 
Education. He was co-director of one of the first assessment development labs 
for the National Board for Professional Teaching Standards. 

Appendix 


54 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Excerpts From 70 Page Document Comparing Portfolio And Case For "Ms. 
Bertram" (prepared by LeeAnn M. Sutherland). 

Note: We have emphasized excerpts from the exhibits that focus on writing. 
Consequently, readers may not find examples of evidence for reading in the 
conclusions which address writing and reading together. (Note 22) 


Portfolio 

Case 

Teacher's goals for the unit 

[based on T's commentary] 

The T states that the unit we see 
in the pf, based on the study of a 
popular age appropriate novella, 
has three goals. She writes that 
Ss will be expected to: 

• Identify the character traits of the 
main characters in the novella, 

• Compose a written response 
citing which character they felt 
they were most similar to/could 
relate to best/liked the best and 
why, and 

• Use that reflection as the 
foundation for creating a 
simulated journal to records 
thoughts and reflections upon the 
events of the plot as viewed 
through that character’s eyes, to 
be evaluated at the end of the 
novel by the teacher and a 
pre-discussed rubric (TR, 2). 

[based on interview notes and fieldnotes] 

According to the CSW, <one of this 
teacher’s goals centered around the Ss 
assessment of their own work .... She 
asked that Ss review all of the entries 
written in their notebooks throughout out 
the entire school year and using selection 
criteria, determine the piece which would 
serve as their best entry> (CSW, 25). 

The 3-day series of lessons observed by 
the CSW were focused on having Ss 
select 3 best pieces of their own writing 
from their notebooks, complete 
evaluation sheets about each one, 
exchange with a partner who would 
name his or her choice for the writer’s 
best piece, revision of that piece of 
writing, and publication on a web page. 
This is their choice to say “1 have grown 
as a writer and 1 have matured. This is 
the piece out of all of the others that 1 am 
proud to call my own.” (CSW, FN 12). 

Teacher's characterization of students 

Based on T's commentary 

The T says: “In a previous unit on 
biography, 1 had realized that this 
particular group of Ss was both 
willing and capable of ‘allying’ 
themselves with characters whose 
gender, life experiences, or age 
were different than their own, if 
the Ss perceived a connection 
based on personality or 

Based on interview notes 

The T describes her 1st period class as 
“the sleeper class” (CSW, 8) and <my 
little puppies coming in every shape and 
size and personality. 1 wish 1 knew them 
as people . . . They don’t fight over 
grades, uncertainties or anything, but 
rather sit there and intimate, ‘Let’s just 
get on with it’ > (CSW, 9). She describes 
her 5/6 period as the “hoopla class,” one 


55 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


motivation” (5). Thus she “hoped 
that Ss would choose to further 
this exploration — that S diaries 
might cross over the lines of age 
and gender perspectives, 
encouraging discussion and 
deeper reflection into the theme of 
the book” (5). 

She says this class is “fearless in 
its class discussions, and the 
whole-class discussion forum 
works well — rather than a 
teacher-led review, these 
discussions often erupt into lives 
of their own. Students 
frequently — though 
politely — challenge one another’s 
ideas” (14, L). 


she calls a <challenge class for me> 
(CSW, 8). 

She describes period 2/3, the portfolio 
class, as <the group I can always count 
on. They have the ability to do whatever 
well> (CSW, 9). They are “a class I never 
have to worry about. They are wonderful, 
enthusiastic, we will do anything for you 
today, Ms. Bertram ...Their personalities 
are like a prism... period 1 tend to be a 
little variegated and textual ...whereas 
period 2 kids come up and share their 
lives with me” (CSW, 10). “Period two is 
the class that I’ll toss an idea out to and 
all of a sudden they’re coming up with 
ideas. Boom, Boom, Boom . . . It’s their 
ideas” (CSW, 11). 


The T generalizes about this 
class, “[Although] this class is, 
overall, a group very much at 
ease with higher-order thinking 
skills, there are a number of more 
concrete Ss who require 
questions of a more literal nature” 
(26). 


During the first pre-observation interview, 
the T talked with the CSW about how she 
anticipated Ss <[responding] to her 
instructional delivery the following day.> 
She differentiated between 1st period 
and 2/3rd period (the pf class): “Period 
one students will find evaluating 
themselves as difficult. They think they 
know what to look for, but ... these 
children often require lots more modeling 
before they fully understand how to 
evaluate. They will also struggle with the 
technical aspects of this assignment. 
Some are still struggling with using 
laptops finding it difficult to SAVE their 
work or to save their work to a disk. 
Picking the entry itself will be very easy 
for these children, but the chosen entry 
may not even be their best piece. 
However, period two students will find the 
lesson a challenge, yet a breeze. Most of 
them are technologically literate and 
experience few technical difficulties using 
the laptops. They will be quite 
enthusiastic about the assignment as 
well as honest. These children know 
what to look for when selecting their best 
piece of work and are most capable of 
knowing the right answers whatever the 
lesson.” 


Chronological summary of activities 


56 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


[based on T's daily log] 
Session 1 


[based on fieldnotes] 
Day One 


SSR — Sustained Silent Reading 
(10 min). The T introduced the 
fantasy unit by having Ss 
brainstorm as a class “What 
elements would we expect to find 
in a fantasy novel?” The class 
then discussed conflict in fantasy 
and the problem-solving role of 
the hero. In small groups, Ss 
brainstormed words and phrases 
that describe a hero, and they 
used crayons to sketch an image 
of a hero. These responses were 
shared with the whole group. The 
T encouraged discussion by 
asking Ss to elaborate; for 
example, when a S offered: “The 
hero may not have planned to be 
a hero,” the T asked the class, 
“What does this mean?” to 
encourage discussion. S 
homework was to write one 
paragraph describing “The Perfect 
Hero” (L, 6-7). 

Session 2 

SSR (10 min.) The class began 
with Ss sharing their homework 
about the perfect hero aloud. 
Afterward, the T guided Ss into 
identifying those characteristics 
they have listed as “traits,” and 
talking about differences between 
physical and nonphysical 
character traits. Ss then played a 
game (10 min.) in which they 
circulated around the room to get 
their peers’ signatures on a 
handout that asked for information 
about both the physical and the 
nonphysical traits of their 
classmates. The worksheet 
required Ss to learn about 24 
topics including who owns a 
kitten, who sings in the shower, 
and who has freckles? (ART, 9A). 


Class begins with Ss writing for about 10 
minutes in their notebooks. <She then 
begins the lesson with, “Who can start 
the review of yesterday’s lesson?” A 
student responds that they did a 
notebook share. She continues with, 

“And what’s the purpose of sharing your 
notebook entries with others?”> (CSW, 

FN 15). T does a power point 
presentation to walk Ss through the steps 
of choosing their “best piece” for 
polishing. Discussion ensues with Ss 
volunteering strategies for choosing a 
best piece, reasons why they might 
consider one “best” and why they might 
“pass” on others. Power point 
presentation includes a sample entry 
which is discussed as a whole class. 
Homework is a “Nomination Ballot” which 
Ss are to use to review the 3 entries they 
had chosen for the previous night’s 
homework. <”You’re going to check off 
the strong points, not so strong points 
and comment on whey these three 
entries could be the entry of the year”> 
(CSW, FN 20). Because the T realized 
after 1st period that the evaluation 
handout was going to be homework, she 
took more time in the 2nd period (pf 
class) such as elaborating more on the 
concepts> as the class worked through 
the power point presentation and 
discussion (CSW, 28). 

Day Two 

Class began with 10-minute writing in 
notebooks followed by a discussion of 
what Ss found easy and found difficult in 
last night’s homework. The T noticed that 
several Ss had incomplete assignments, 
so she reviewed the concept of 
“reflection” on the best piece, answering 
“Why” it could be a winner. 

Day Three 


57 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


When finished, Ss shared what 
they had learned about their 
classmates and discussed “which 
character traits convey the most 
information about a person or 
character” (L, 9). 

[...sessions 3-8 described....] 

Session 9 

T read-aloud continued (10 min.) 
Following this, Ss received 
instructions and did individual 
seatwork in preparation for the 
following day’s Goldfish Bowl 
activity. Ss were to answer 4 
questions on a handout, and the T 
circulated to answer questions 
and keep students “on task.” 
Questions include: “Do the main 
characters fit the criteria of a 
hero — at least, the hero we 
discussed in our first lesson? Why 
or why not?” Ss were also asked 
a prediction question, a question 
about changing the novel’s point 
of view, and a question relating 
faults as sometimes helpful to a 
particular character in the 
novel — to how faults could be 
helpful in real life (23a). The last 
15 minutes of the period were 
devoted to small group 
discussion, assigning a single to 
each member, revising individual 
responses based on other group 
members’ responses, and 
preparing to share those the 
following day (22-23, 23a, L). 

Session 10 — the videotaped 
session 

(No oral reading.) Spent the entire 
class period on the Goldfish Bowl 
activity. A “Reflection Sheet” 

(ART, 25A) asked Ss to respond 
to 3 questions about each of the 
larger questions discussed in the 
bowl. The worksheet asked, 


Class began with 10-min writing in 
notebooks and review of what they have 
done thus far in regard to the notebooks. 
The next step is revising, which the T 
walked Ss through using a Power Point 
presentation. They were also given a 
handout to guide them. 

A difference was noted in the way 
homework was assigned to both classes. 
Period one was to finish the teacher 
created ditto, while period two was to 
complete not only the ditto but revisions 
on their best entry of the year as well. 
The teacher reported that it was a 
mistake on her part, realizing that period 
one should have finished their revisions 
as well 


58 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


“Which part of the discussion 
most closely matched your own 
answer? How?” and “Did you 
disagree with or not understand 
some part of the discussion? 

Which part? Why?” and “Do you 
feel the ‘goldfish’ left anything 
out? If yes, what?” The T 
indicates that this activity closed 
“the pre-vacation leg of the unit” 

(25, L). 

Comparison :[[\n both portraits, we see the T provide a variety of ELA 
experiences for these Ss. Reading, writing, speaking and listening happen on 
most days represented in the pf and in the CS. 

Configurations are varied — small group, whole class, and pairs, in both 
self-selected and T-selected groupings. 

We also see lessons that build on one another toward completion of a final 
composition. In the pf, those compositions are a character diary and a poem, 
and in the CS the final composition is a revision for on-line publication of a 
S-selected favorite piece of writing. While the goal in each case was to create 
a product, the lessons focused on other important skills. Through handouts in 
the pf and a Power Point presentation in the CS, we see the T scaffold 
students’ learning. She asks questions which aim toward having students 
achieve particular goals, but the questions themselves do not lead students to 
particular answers. In both portraits, we see the T guide her students in 
developing their own critical thinking skills. And, in both the pf and the CS, we 
see the T attend to development of students’ metacognitive skills. 

Also, in both portraits we see established routines that Ss seem to respond 
well to and which seem to align with T goals. In the written pf text, the T tells 
us how students respond to particular activities, and in the video, we see that 
what she says is true. The CS provides additional evidence of the same. Both 
portraits would likely lead to the same evaluation of this T.]] 

Classroom Interaction 

Based on video Based on fieldnotes 


In the video conferencing session, 
we see the T seated at a table 
with four students. The procedure 
is that each S reads his or her 
poem aloud to the group, and 
group members provide feedback. 
The T facilitates by repeating and 
clarifying S responses after each 
one speaks, and by deciding 
when to move on to the next 


[Whole-class interaction in both the pf 
and the non-pf class is described by the 
CSW as representing the same pattern. 
The following example is one of several 
that illustrate that the T poses a question 
and Ss respond one after another with a 
variety of ideas.] < After Ss have read a 
sample the T has provided as part of her 
Power Point lesson, she poses the 
question, “Does the entry show deeper 


59 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


writer. Students are clearly thinking?” [A quality they had already 

practiced in this type of session. determined was necessary for a good 
They make comments such as, journal entry.] The CSW reports: <Mark 
“When you say ‘things,’ you could attempts to give an example from the 
go deeper,” and “You know how sample selection but his answer is not 
you said it’s confusing? When I met with further agreement. Margaret 
went over mine, I thought the states that the piece is boring and the 

same thing . A student also teacher does admit that she feels it’s dull 

defends his poem which two as well. Alex claims that the piece just 

classmates say is “too deep” with, tells about a fight with a friend, yet claims 

“That’s what I was aiming for” and the reader really doesn’t know what the 
explaining, “I’m trying to stay writer is talking about because it’s lacking 

away from the word ‘like.’” The T in feelings. Josh says that it’s missing 
makes comments such as, “What elaboration and just doesn’t make sense. 
I’m hearing you say is that we Sabrina openly states, “It is a diary entry 
have a good poem here, we just and simply that” and the teacher agrees 

need ...” and she ends the that the piece is lacking in deeper 

session by telling the students thinking and asks if it paints a picture. 

“I’m very pleased” with the way in The class offers a resounding NO and 
which they conferenced. Stella chimes in that it just kept listing 

things. George says that the writer got 
mad and stayed mad without giving 
details. Another student states that the 
piece has no elaboration nor does it have 
sensory details. Brittany says she 
couldn’t feel empathy with the entry and 
just read it, not felt it. The teacher then 
asks the students how many of them 
have ever fought with friends and shared 
similar experiences> and the class 
moves on to another topic (CSW, 15). 

Comparison: In both the pf and the CS, we observe similar classroom 
routines. We see the first 10 minutes of each class period spent on SSR in 
the pf, notebook writing in the CS. In both portraits, we see the T begin class 
sessions by asking a S volunteer to recap the previous day’s learning, then 
proceeding with the day’s activities. We witness a variety of classroom 
configurations. A horseshoe arrangement of desks is seen on the pf video 
and is described by the CSW. We observe whole class discussion, peer cfs 
about writing, small group activities, T lecture, presentations, and preparation 
of written texts for publication. Class sessions frequently close with the 
assignment of homework or preview of the following day’s activities. 

While on the video the classroom may be seen as quite controlled in terms of 
time management and change from one activity to another by teacher 
command, it is clear in the content of the talk that students are thinking and 
learning. Their questions and comments to peers in writing conferences and 
in whole-group discussions indicate that they have learned to talk with one 
another about ideas, to give one another feedback about writing, and to 
question and compliment their classmates. These are observable in the pf, 


60 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


and are affirmed by the CSWs observations of classroom dialogue among Ss 
and T. 


Teacher's reflection on classroom interaction 


Based on commentary 


Based on interview notes 


In order to “keep the noise to a 
minimum,” the T reports that for 
the videotaping, she chose to 
work with one cooperative group 
while other Ss worked individually. 
When videotaping was finished, 
students were “released to work in 
peer groups of their own” (48). 

The T says that “each member of 
the group eagerly offered his or 
her perceptions on the shared 
work.” She also reports that she 
questioned one Ss understanding 
of the assignment, but otherwise 
felt that Ss “were fully in 
command of the assignment.” She 
offered her perspective to and 
moderated the discussion, but 
“made a conscious attempt not to 
influence student critiques” even 
when she felt that a S “was being 
a bit too literal in his images” (48). 
Ss have been in revision groups 
before, but the T notes that this is 
the first time they have written and 
revised poetry this year. 

Student Y is one of what the T 
terms “gray children,” a student 
she says she needs to “watch 
particularly for” and “make a 
concerted effort to draw into class 
discussions and activities,” as 
they readily “fade into the 
background” among more 
assertive Ss, and “do not cry out 
for the attention the lower-ability 
Ss require” (31-2). 

[[What I see on the video jibes 
with the T’s description of it. The 
writing session does feel 
somewhat rushed. The T did try to 
“facilitate” the writing conference 


Reflecting upon a day’s lesson, the T 
says she was surprised by 1st period’s 
response: “They aren’t generally that 
participatory” and she <admitted to 
perhaps selling them short as a group. 
She felt that perhaps the presence of a 
stranger motivated them on. In addition, 
she felt that the lesson ran its course the 
way she felt it would and she was 
disappointed that they didn’t remember 
all of the qualities of good writing gone 
over since the beginning of the year. 
However, she recognized that the only 
quality period one struggled with was 
defining honest writing. She quoted, “One 
out of four isn’t bad.”> 

The T <felt that the students in period two 
demonstrated typical class behaviors by 
showing high levels of participation, 
enthusiasm for the lesson content, 
expressing that this is what she expects 
from period two. She further admits to 
being thrilled with the level of response 
from period one, yet states that “they 
have their on days and off days and you 
can jiggle the switch but the light doesn’t 
go on.” She can’t figure out why this 
happens at times stating, “that it can’t be 
the kids because they’re not low level 
kids. They’re average kids.” It’s a 
different personality and she feels these 
children don’t know her as well as the 
other classes know her as they only 
experience her one period a day versus 
her other two classes which both have 
her for two periods in a day. She 
describes period one as “moving 
differently, flowing. They’re far more 
vested in each other than me as a 
teacher”> (CSW, 31). 

When the CSW asked the T <to describe 
her interaction with a boy from period two 


61 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


who has a gift for analyzing detail and 
working with words. She comments, “He 
has written poetry that would stop your 
heart but he holds himself to a standard 
that is three times higher than any other 
child. He worries about things and he is 
able to grasp things that never even 
occur to the other kids. It works against 
him sometimes and needs a lot of 
reassurance and validation. Sometimes 
he just thinks too much and at times 
needs to have limits placed on him 
because of his drive and tremendous 
efforts.” Teacher encourages him “to 
relax rather than get an ulcer before the 
age of twenty. ”> (CSW, 33). 

Comparison: In the pf, we have info, about how the T views individual Ss as 
she writes about the 5 writing samples included in her pf. In this CS, the T 
also talks a great deal in interviews about individual Ss. In both cases, she 
talks both about who the child is as a person (e.g. personality, affect) and who 
the child is as a student (e.g. academic strengths and weaknesses). In both 
portraits we also see a range; the T provides information both about students 
who are the most successful and about students who struggle the most in her 
classes. 

In both portraits we also see this T’s use of terms like “abstract randoms,” and 
“concrete learners,” and she describes herself as “concrete random.” She 
appears to shape activities with these “types” in mind, as she does talk about 
“types” of Ss who will succeed or struggle with particular activities. 

She also describes individual Ss in terms of “high ability,” “lower ability,” and 
“average.” There is no indication, however, that she holds Ss to particular 
standards based on which of these she believes the S to be. 


group discussion, but she acted 
somewhat more as “guide” for the 
fishbowl activity, seemingly to 
push the Ss thinking. She says 
that the first group was nervous, 
and two of the members we see 
on camera do seem very 
nervous.]] 


And, while the Ts characterizations of each class’ personality differs, the CSW 
reports that “the teacher has the same basic instructional strategies for all of 
her classes” (CSW, FN 9). 

T's assessment of student work 

Based on comments written on Based on fieldnotes and interview notes 
students' papers 


On Student X’s paper, the T 
circles 3 spelling errors and one 
missing comma. At the bottom of 
the page she writes 4 comments: 

Student X, much of this is wording 
taken almost directly from the 


[[While we see no evaluated S writing in 
the CS, the T reviewed with Ss the 
qualities of good writing before (and 
during) their reading, selecting, 
conferencing, and revising processes for 
the task of polishing a journal entry for 
publication.]] 


62 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


book; that’s not the best tactic 
here. 

Use your own words. 

Not clear whose POV you’re 
taking in this entry . . . 

What about this character’s 
thoughts and feelings? this is 
mainly summary. 

Reflect on the events — don’t’ just 
list them! 

Based on T's commentary 

In reviewing the Ss initial 
character diary entries, the T 
reveals, “I feel I may have 
overestimated the ability of many 
Ss to take another’s point of view. 

The work of Student X in 
particular reminds me that they 
are still young, and I wonder if 
some Ss have not yet matured 
enough to see beyond 
themselves, to look at things from 
another’s perspective. Student X’s 
entry shows that he is taking the 
book very much at face value; he 
either cannot see (or does not 
wish to take the effort to see) the 
character traits of the characters 
conveyed in their words or 
actions” (33). 

Comparison: It seems that even though we do not see the Ts actual 
feedback on Ss work in the CS, the nature of her conversation with them and 
the content of the slides and the evaluation sheets Ss are to use with peers 
make us confident that she will actually use these standards to evaluate the 
final products. So while we know more detailed information, and see the 
actual follow-through to final draft only in the pf, there is nothing in the two 
portraits that would likely cause this T to be evaluated differently in regard to 
what she thinks is important in evaluating S writing.]] 

Teacher's reflection on her teaching 

The T reports that when she <ln regards to long range goals, the T 

begins this unit another time, she reports that this is the first time she has 

will spend more time on particular tried this unit, yet feels she might do the 


The CSW writes that the T <starts with 
discussing the ingredients of a good 
notebook entry: deeper thinking, painting 
a picture with words, the value of coming 
up with one’s own ideas, not writing diary 
type words and striving for that full page 
of writing> (FN 45). 

[[Two artifacts appended to the pf provide 
the more specific detail.]] A slide from the 
T’s Power Point presentations asks: 

Does the entry show “deeper thinking?” 

Does the writer “paint a picture with 
words?” 

Is the writing “honest writing?” 

Is the topic unique, unusual, or 
particularly meaningful? 

The Nomination Ballot students use to 
evaluate their own and a classmates’ 
work lists the following criteria for 
judgment in addition to the above: 

Entry is clear and focused 

Wonderful use of “show, don’t tell.” 

Captivating written “voice.” 


63 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


aspects of studying of the novel same unit again next year and states that 
that she felt Ss needed more time she will give the introductory portion in 
with. For example, she says that one day and the nomination ballot for 
next time she will “spend one day homework. She also believes that the 
focusing only on the aspects of selections regarding best pieces will be 
fantasy before launching into the done in class ‘to eliminate the 
hero.” She combined the two in consequences of Ss who come to class 
this lesson, and reflects: “it would unprepared’> (CSW, 33). 
certainly have benefited from 

additional time for student work < During the three day observation 
and discussion” (L, 7). period, this interviewer was able to 

witness her creating curriculum from day 
The T notes that a vocabulary to day based on her monitoring and 
glossary or a vocabulary activity adjusting for student learning first-hand, 
“in preparation for the reading” Her Power Point demonstration on 
might be useful to Ss, as they revision, for instance, was created in 
asked about several words as the response to her feeling that the students 
T read chapter 1 of the novel needed it to complete the task at hand> 

aloud. The T lists “prodigious,” (CSW 29). 
frenzied,” and “exclusive” as 
examples (L, 10-11). 

The T also says she “may have 
overestimated the ability of many 
students to take another’s point of 
view.” Her conclusion is, “In the 
future, I should probably ‘test 
drive’ the character diary on a 
short story before applying it to a 
longer piece, so as to better 
assess the abilities of the class to 
step into another’s point of view” 

(33). 


GQ 2.1 Describe the ways in which the teacher creates a learning 
environment that provides all Ss with opportunities to develop as 
readers, writers, and thinkers: 

[[In the pf, the T describes “types” of S learners (e.g. concrete, random) in her 
logs, her other written text, and her descriptions of those whose writing 
samples she includes. She speaks in both the pf and the CS about 
accommodations for special needs Ss (e.g. special education), and the CSW 
observes the T’s accommodations for absent Ss. 

In both the pf and the CS, we observe similar classroom routines. We see the 
first 10 minutes of each class period spent on SSR in the pf, notebook writing 
in the CS. In both portraits, we see the T begin class sessions by asking a S 
volunteer to recap the previous day’s learning, then proceeding with the day’s 
activities. We witness a variety of classroom configurations. A horseshoe 


64 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


arrangement of desks is seen on the pf video and is described by the CSW. 
We observe whole class discussion, peer cfs about writing, small group 
activities, T lecture, presentations, and preparation of written texts for 
publication. Class sessions frequently close with the assignment of homework 
or preview of the following day’s activities. 

While on the video the classroom may be seen as quite controlled in terms of 
time management and change from one activity to another by teacher 
command, it is clear in the content of the talk that students are thinking and 
learning. Their questions and comments to peers in writing conferences and 
in whole-group discussions indicate that they have learned to talk with one 
another about ideas, to give one another feedback about writing, and to 
question and compliment their classmates. These are observable in the pf, 
and are affirmed by the CSWs observations of classroom dialogue among Ss 
and T. 

The T remarked in the pf text and in the CS interviews that some of the 
activities she planned may be (or may have been) too difficult for some of the 
Ss. When she anticipated that in advance, she attempted to shape instruction 
accordingly. When she realized Ss difficulties during or after class, she altered 
her plans for the next day (e.g. created another handout, attended to another 
aspect of the writing process in her Power Point presentation), or she 
indicated how she will alter instruction in the future. In both portraits, the T 
creates activities that challenge Ss, and her questions challenge their thinking 
on handouts and in discussion activities. In addition, we see the T make minor 
alterations in a particular lesson from one class to another, but she indicates 
and the CSW observes that she works with the same lesson plans, 
assignments, and “talk” as she guides S learning. In each individual class, 
however, the T asks questions and pushes discussion based on Ss 
contributions, thus responds in flexible ways to their learning as a class and 
as individuals within that class. 

For this T, the CS reinforces what we learn in the pf. Either portrait alone 
would have given a substantial amount of information in regard to this Guiding 
Question, and neither contains information that would likely lead us to 
evaluate this T differently.]] 

GQ 4. 1 Describe how the teacher addresses student learning in 
reflection. 

GQ 4.2 Describe how the teacher uses that reflection to inform practice. 

[[This T, in both the pf and the CS, addresses issues of S learning in a variety 
of ways, including addressing learning in terms of the class as a whole and 
the individuals within the class. She reflects on students in general 
(age/developmentally) as well as on what she has learned about her own 
students this year. 

In both the pf and the CS, we see the T reflect on student learning in terms of 
the nature of their contributions to discussion. In the pf, she reflects on S 
learning in terms of their “command” of assignments and of specific tasks, 


65 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


and in terms of the depth of reflection and the insightfulness shown in their 
writing. In the CS, she reflects on Ss abilities, their strengths and 
weaknesses, and the behavior they exhibit during class sessions. So, while 
we gather some similar and some different information about how the T 
reflects on S learning in each of these portraits, we can see in both that she 
draws upon a range of information that is both accurate and important in 
informing her instruction. None of the differences between the evidence 
presented in the pf and that presented in the CS is likely to be the cause of a 
different evaluation for this T.]] 


The World Wide Web address for the Education Policy Analysis Archives is 

epaa.asu.edu 

Editor: Gene V Glass, Arizona State University 

Production Assistant: Chris Murrell, Arizona State University 

General questions about appropriateness of topics or particular articles 
may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach 
him at College of Education, Arizona State University, Tempe, AZ 
85287-241 1 . The Commentary Editor is Casey D. Cobb: 
casey.cobb@unh.edu. 

EPAA Editorial Board 


Michael W. Apple 
University of Wisconsin 

Greg Camilli 
Rutgers University 

Sherman Dorn 
University of South Florida 

Gustavo E. Fischman 
Arizona State Univeristy 

Thomas F. Green 
Syracuse University 

Craig B. Howley 

Appalachia Educational Laboratory 

Patricia Fey Jarvis 
Seattle, Washington 

Benjamin Levin 
University of Manitoba 

Les McLean 
University of Toronto 


David C. Berliner 
Arizona State University 

Linda Darling-Hammond 
Stanford University 

Mark E. Fetler 

California Commission on Teacher 
Credentialing 

Richard Garlikov 
Birmingham, Alabama 

Aimee Howley 
Ohio University 

William Hunter 

University of Ontario Institute of 

Technology 

Daniel Kallos 
Umea University 

Thomas Mauhs-Pugh 
Green Mountain College 

Heinrich Mintrop 

University of California, Los Angeles 


66 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


Michele Moses 
Arizona State University 

Anthony G. Rud Jr. 
Purdue University 

Michael Scriven 
University of Auckland 

Robert E. Stake 
University of Illinois — UC 

Terrence G. Wiley 
Arizona State University 


Gary Orfield 

Harvard University 

Jay Paredes Scribner 

University of Missouri 

Lorrie A. Shepard 

University of Colorado, Boulder 

Kevin Weiner 

University of Colorado, Boulder 

John Willinsky 

University of British Columbia 


EPAA Spanish & Portuguese Language Editorial Board 

Associate Editors 

Gustavo E. Fischman 
Arizona State University 
& 

Pablo Gentili 

Laboratorio de Politicas Publicas 
Universidade do Estado do Rio de Janeiro 

Founding Associate Editor for Spanish Language (1998 — 2003) 
Roberto Rodriguez Gomez 
Universidad Nacional Autonoma de Mexico 


Argentina 

• Alejandra Birgin 

Ministerio de Education, Argentina 
Email: abirgin@me.gov.ar 

• Monica Pini 

Universidad Nacional de San Martin, Argentina 
Email: mopinos@hotmail.com, 

• Mariano Narodowski 

Universidad Torcuato Di Telia, Argentina 
Email: 

• Daniel Suarez 

Laboratorio de Politicas Publicas-Universidad de Buenos Aires, 
Argentina 

Email: daniel@lpp-buenosaires.net 

• Marcela Mollis (1998— 2003) 

Universidad de Buenos Aires 

Brasil 

• Gaudencio Frigotto 

Professor da Faculdade de Educagao e do Programa de 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


Pos-Graduagao em Educagao da Universidade Federal Fluminense, 
Brasil 

Email: gfrigotto@globo.com 

• Vanilda Paiva 

Email:vppaiva@terra. com.br 

• Lilian do Valle 

Universidade Estadual do Rio de Janeiro, Brasil 
Email: lvalle@infolink.com.br 

• Romualdo Portella do Oliveira 

Universidade de Sao Paulo, Brasil 

Email: romualdo@usp.br 

• Roberto Leher 

Universidade Estadual do Rio de Janeiro, Brasil 
Email: rleher@uol.com.br 

• Dalila Andrade de Oliveira 

Universidade Federal de Minas Gerais, Belo Horizonte, Brasil 
Email: dalila@fae.ufmg.br 

• Nilma Limo Gomes 

Universidade Federal de Minas Gerais, Belo Horizonte 
Email: nilmagomes@uol.com.br 

• lolanda de Oliveira 

Faculdade de Educagao da Universidade Federal Fluminense, Brasil 
Email: iolanda.eustaquio@globo.com 

• Walter Kohan 

Universidade Estadual do Rio de Janeiro, Brasil 
Email: walterko@uol.com.br 

• Maria Beatriz Luce (1998 — 2003) 

Universidad Federal de Rio Grande do Sul-UFRGS 

• Simon Schwartzman (1998 — 2003) 

American Institutes for Resesarch-Brazil 

Canada 

• Daniel Schugurensky 

Ontario Institute for Studies in Education, University of Toronto, Canada 
Email: dschugurensky@oise.utoronto.ca 


Chile 


• Claudio Almonacid Avila 

Universidad Metropolitana de Ciencias de la Educacion, Chile 
Email: caa@rdc.cl 

• Maria Loreto Egana 

Programa Interdisciplinario de Investigacion en Educacion (PIIE), Chile 
Email: legana@academia.cl 


Espana 

• Jose Gimeno Sacristan 

Catedratico en el Departamento de Didactica y Organization Escolar de 
la Universidad de Valencia, Espana 
Email: Jose.Gimeno@uv.es 


68 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


http://epaa.asu.edu/epaa/vl2n32/ 


• Mariano Fernandez Enguita 

Catedratico de Sociologfa en la Universidad de Salamanca. Espana 
Email: enguita@usal.es 

• Miguel Pereira 

Catedratico Universidad de Granada, Espana 
Email: mpereyra@aulae.es 

• Jurjo Torres Santome 
Universidad de A Coruna 

Email: jurjo@udc.es 

• Angel Ignacio Perez Gomez 
Universidad de Malaga 

Email: aiperez@uma.es 

• J. Felix Angulo Rasco (1998 — 2003) 

Universidad de Cadiz 

• Jose Contreras Domingo (1998 — 2003) 

Universitat de Barcelona 

Mexico 

• Hugo Aboites 

Universidad Autonoma Metropolitana-Xochimilco, Mexico 
Email: aavh4435@cueyatl.uam.mx 

• Susan Street 

Centro de Investigaciones y Estudios Superiores en Antropologia Social 
Occidente, Guadalajara, Mexico 
Email: slsn@mail.udg.mx 

• Adrian Acosta 
Universidad de Guadalajara 

Email: adrianacosta@compuserve.com 

• Teresa Bracho 

Centro de Investigation y Docencia Economica-CIDE 
Email: bracho disl.cide.mx 

• Alejandro Canales 

Universidad Nacional Autonoma de Mexico 
Email: canalesa@servidor.unam.mx 

• Rollin Kent 

Universidad Autonoma de Puebla. Puebla, Mexico 
Email: rkent@puebla.megared.net.mx 

• Javier Mendoza Rojas (1998 — 2003) 

Universidad Nacional Autonoma de Mexico 

• Humberto Munoz Garcia (1998 — 2003) 

Universidad Nacional Autonoma de Mexico 


Peru 


• Sigfredo Chiroque 

Instituto de Pedagogia Popular, Peru 
Email: pedagogia@chavin.rcp.net.pe 

• Grover Pango 

Coordinador General del Foro Latinoamericano de Polfticas Educativas, 
Peru 

Email: grover-eduforo@terra.com.pe 


69 of 70 


7/19/2004 12:06 PM 


EPAA Vol. 12 No. 32 Moss, et al: Interrogating the Generalizability of P... 


Portugal 

• Antonio Teodoro 

Director da Licenciatura de Ciencias da Educagao e do Mestrado 
Universidade Lusofona de Humanidades e Tecnologias, Lisboa, 
Portugal 

Email: a.teodoro@netvisao.pt 

USA 

• Pia Lindquist Wong 

California State University, Sacramento, California 
Email: wongp@csus.edu 

• Nelly P. Stromquist 

University of Southern California, Los Angeles, California 
Email: nellystromquist@juno.com 

• Diana Rhoten 

Social Science Research Council, New York, New York 
Email: rhoten@ssrc.org 

• Daniel C. Levy 

University at Albany, SUNY, Albany, New York 
Email: Dlevy@uamail.albany.edu 

• Ursula Casanova 

Arizona State University, Tempe, Arizona 
Email: casanova@asu.edu 

• Erwin Epstein 

Loyola University, Chicago, Illinois 
Email: eepstei@wpo.it.luc.edu 

• Carlos A. Torres 

University of California, Los Angeles 
Email: torres@gseisucla.edu 

• Josue Gonzalez (1998 — 2003) 

Arizona State University, Tempe, Arizona 


home abstracts complete editors submit comment notices search 

EPAA is published by the Education Policy Studies 
Laboratory, Arizona State University 


